gpt4 book ai didi

Python-Beautiful Soup 不解析整个无序列表

转载 作者:行者123 更新时间:2023-11-28 22:45:20 24 4
gpt4 key购买 nike

我正在尝试抓取一个网站,但有一部分让我感到困惑。有一个由组织服务的位置的无序列表,我似乎可以解析整个列表。

这是 HTML 的示例:

<div id="current_tab">

<p class="view_label_type_geoserved" id="view_label_field_geoserved">Geographies Served</p>
<ul>
<li class="view_type_geoserved" id="view_field_geoserved">
<p style="font-weight: bold; border-bottom: 1px dotted #CCC; font-size: .9em;">North Carolina (NC)<span style="float: right; font-size: 0.8em;">North Carolina (NC)</span></p>
<p style="margin: 5px 0 3px 8px; border-bottom: 1px dotted #DDD; font-size:1em">Durham (serves entire county)<span style="float: right; font-size: 0.8em;">Durham</span></p>
</li>
<p style="margin: 5px 0 3px 8px; border-bottom: 1px dotted #DDD; font-size:1em">Franklin (serves entire county)<span style="float: right; font-size: 0.8em;">Franklin</span></p>
</li>
<p style="margin: 5px 0 3px 8px; border-bottom: 1px dotted #DDD; font-size:1em">Granville (serves entire county)<span style="float: right; font-size: 0.8em;">Granville</span>
</p>
</li>
<p style="margin: 5px 0 3px 8px; border-bottom: 1px dotted #DDD; font-size:1em">Orange (serves entire county)<span style="float: right; font-size: 0.8em;">Orange</span></p>
</li>
<p style="margin: 5px 0 3px 8px; border-bottom: 1px dotted #DDD; font-size:1em">Person (serves entire county)<span style="float: right; font-size: 0.8em;">Person</span></p>
</li>
<p style="margin: 5px 0 3px 8px; border-bottom: 1px dotted #DDD; font-size:1em">Vance (serves entire county)<span style="float: right; font-size: 0.8em;">Vance</span></p>
</li>
<p style="margin: 5px 0 3px 8px; border-bottom: 1px dotted #DDD; font-size:1em">Wake (serves entire county)<span style="float: right; font-size: 0.8em;">Wake</span></p>
</li>
<p style="margin: 5px 0 3px 8px; border-bottom: 1px dotted #DDD; font-size:1em">Warren (serves entire county)<span style="float: right; font-size: 0.8em;">Warren</span></p>
</li>
</ul>
</div>

这是我用来解析元素的内容

for i in soup.find('div', {'id':'current_tab'}).findAll('p'):
print i

这是我得到的结果,注意它只是列表的开头:

<p class="view_label_type_geoserved" id="view_label_field_geoserved">Geographies Served</p>
<p style="font-weight: bold; border-bottom: 1px dotted #CCC; font-size: .9em;">North Carolina (NC)<span style="float: right; font-size: 0.8em;">North Carolina (NC)</span></p>
<p style="margin: 5px 0 3px 8px; border-bottom: 1px dotted #DDD; font-size:1em">Durham (serves entire county)<span style="float: right; font-size: 0.8em;">Durham</span></p>
<p style="margin: 5px 0 3px 8px; border-bottom: 1px dotted #DDD; font-size:1em">Franklin (serves entire county)<span style="float: right; font-size: 0.8em;">Franklin</span></p>

一旦我取回 HTML,我就有了使用正则表达式去除文本然后将它们连接成单个字符串的函数,但也将不胜感激。

最佳答案

问题是您正在处理的 HTML 需要一个宽松的解析器来解析。

使用 lxmlhtml5lib:

soup = BeautifulSoup(data, 'html5lib')  # or BeautifulSoup(data, 'lxml')
for p in soup.select('div#current_tab p'):
print p.text

对我有用,它打印:

Geographies Served
North Carolina (NC)North Carolina (NC)
Durham (serves entire county)Durham
Franklin (serves entire county)Franklin
Granville (serves entire county)Granville

Orange (serves entire county)Orange
Person (serves entire county)Person
Vance (serves entire county)Vance
Wake (serves entire county)Wake
Warren (serves entire county)Warren

关于Python-Beautiful Soup 不解析整个无序列表,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/28721922/

24 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com