gpt4 book ai didi

python - 从html内容中提取数据

转载 作者:太空宇宙 更新时间:2023-11-04 10:53:43 24 4
gpt4 key购买 nike

我想下载一些 html 页面并提取信息,每个 HTML 页面都有这个 table 标签:

<table class="sobi2Details" style='background-image: url(http://www.imd.ir/components/com_sobi2/images/backgrounds/grey.gif);border-style: solid; border-color: #808080' >
<tr>
<td><h1>Dr Jhon Doe</h1></td>
</tr>
<tr>
<td></td>
</tr>
<tr>
<td></td>
</tr>
<tr>
<td>
<div id="sobi2outer">
<br/>
<span id="sobi2Details_field_name" ><span id="sobi2Listing_field_name_label">name:</span>Jhon</span><br/>
<span id="sobi2Details_field_family" ><span id="sobi2Listing_field_family_label">family:</span> Doe</span><br/>
<span id="sobi2Details_field_tel1" ><span id="sobi2Listing_field_tel1_label">tel:</span> 33727464</span><br/>
</div>
</td>
</tr>
</table>

我想访问姓名(Jhone),家庭(Doe)和电话(33727464),我使用了beausiful soup通过 id 访问这些 span 标签:

name=soup.find(id="sobi2Details_field_name").__str__()
family=soup.find(id="sobi2Details_field_family").__str__()
tel=soup.find(id="sobi2Details_field_tel1").__str__()

但我不知道如何将数据提取到这些标签中。我尝试使用 childrencontent 属性,但是当我将主题用作 标记 它返回:

name=soup.find(id="sobi2Details_field_name")
for child in name.children:
#process content inside

但是我得到这个错误:

'NoneType' object has no attribute 'children'

当我在上面使用 str() 时,它不是 None!!有什么想法吗?

编辑:我的最终解决方案

soup = BeautifulSoup(page,from_encoding="utf-8")
name_span=soup.find(id="sobi2Details_field_name").__str__()
name=name_span.split(':')[-1]
result = re.sub('</span>', '',name)

最佳答案

我找到了几种方法。

from bs4 import BeautifulSoup
soup = BeautifulSoup(open(path_to_html_file))

name_span = soup.find(id="sobi2Details_field_name")

# First way: split text over ':'
# This only works because there's always a ':' before the target field
name = name_span.text.split(':')[1]

# Second way: iterate over the span strings
# The element you look for is always the last one
name = list(name_span.strings)[-1]

# Third way: iterate over 'next' elements
name = name_span.next.next.next # you can create a function to do that, it looks ugly :)

如果有帮助,请告诉我。

关于python - 从html内容中提取数据,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/11701490/

24 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com