gpt4 book ai didi

html - 如何使用 lxml 根据变量列表解析 HTML 表?

转载 作者:太空宇宙 更新时间:2023-11-03 19:18:18 25 4
gpt4 key购买 nike

我正在尝试使用 lxml 解析 HTML 表。而rows = outhtml.xpath('//tr/td/span[@class="boldred"]/text()')获取结果,我尝试仅在列内容以配置文件中的变量开头时提取列内容。例如,如果 <td>从“Street 1”开始,然后我想抓取 <span>内容<td>标签。这样,我就可以拥有一个元组的元组(它负责 None 值),然后将其存储在数据库中。

lxml_parse.py

import lxml.html as lh

doc=open('test.htm', 'r')
outhtml=lh.parse(doc)
doc.close()

rows = outhtml.xpath('//tr/td/span[@class="boldred"]/text()')
print rows

test.htm

<tr>

<td></td>

<td colspan="2">

Street 1:<span class="required"> *</span><br />

<span class="boldred">2100 5th Ave</span>

</td>

<td colspan="2">

Street 2:<br />

<span class="boldred">Ste 202</span>

</td>

</tr>

<tr>

<td></td>

<td>

City:<span class="required"> *</span><br />

<span class="boldred">NYC</span>

</td>

<td>

State:<br />

<SPAN CLASS="boldred2"></SPAN><br/><SPAN CLASS="boldred">NY</SPAN>

</td>

<td>

Country:<span class="required"> *</span><br />

<SPAN CLASS="boldred2"></SPAN><br/><SPAN CLASS="boldred">USA</SPAN>

</td>

<td>

Zip:<br />

<span class="boldred">10022</span>

</td>

</tr>

输出:

$ python lxml_parse.py 
['2100 5th Ave', 'Ste 202', 'NYC', 'NY', 'USA', '10022']

解析一堆变量是我遇到的问题:

import lxml.html as lh

desiredvars = ['Street 1','Street 2','City', 'State', 'Zip']

doc=open('test.htm', 'r')
outhtml=lh.parse(doc)
doc.close()

myresultset = ((var, outhtml.xpath('//tr/td[child::*[text()=var]]/span[@class="boldred"]/text()')) for var in desiredvars)
print myresultset

最佳答案

旨在制作这本词典:

{'City:': 'NYC', 
'Zip:': '10022',
'Street 1:': '2100 5th Ave',
'Country:': 'USA',
'State:': 'NY',
'Street 2:': 'Ste 202'}

您可以使用此代码。然后就可以很容易地查询字典来获取你想要的值:

import lxml.html as lh

test = '''<tr>
<td></td>
<td colspan="2">
Street 1:<span class="required"> *</span><br />
<span class="boldred">2100 5th Ave</span>
</td>
<td colspan="2">
Street 2:<br />
<span class="boldred">Ste 202</span>
</td>
</tr>
<tr>
<td></td>
<td>
City:<span class="required"> *</span><br />
<span class="boldred">NYC</span>
</td>
<td>
State:<br />
<SPAN CLASS="boldred2"></SPAN><br/><SPAN CLASS="boldred">NY</SPAN>
</td>
<td>
Country:<span class="required"> *</span><br />
<SPAN CLASS="boldred2"></SPAN><br/><SPAN CLASS="boldred">USA</SPAN>
</td>
<td>
Zip:<br />
<span class="boldred">10022</span>
</td>
</tr>'''

outhtml = lh.fromstring(test)
ks = [ k.strip() for k in outhtml.xpath('//tr/td/text()') if k.strip() != '' ]
vs = outhtml.xpath('//tr/td/span[@class="boldred"]/text()')

result = dict( zip(ks,vs) )

print result

关于html - 如何使用 lxml 根据变量列表解析 HTML 表?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/10642513/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com