gpt4 book ai didi

python - 使用/BeautifulSoup : Assign H4 Header ID to Elements in a List 进行网页抓取

转载 作者:太空宇宙 更新时间:2023-11-04 11:18:44 25 4
gpt4 key购买 nike

我正在做网络抓取,有几个 h4 标签,每个标签下面都有列表。我想抓取每个列表的元素并将其分配给每个 h4 标签的 id。这是 HTML:

<h4 class="dataHeaderWithBorder" id="Production" name="production">Production</h4>
<ul class="simpleList">
<li><a href="/company/co0308?ref_=xtco_co_1">Red Claw </a></li>
<li><a href="/company/co0386?ref_=xtco_co_2">Haven </a></li>
<li><a href="/company/co0487?ref_=xtco_co_3">Frame</a></li>
</ul>
<h4 class="dataHeaderWithBorder" id="Distribution" name="Distribution">Distribution</h4>
<ul class="simpleList">
<li><a href="/company/co0017?ref_=xtco_co_1">Broadside Attractions</a> </li>
<li><a href="/company/co0208?ref_=xtco_co_2"> Global Acquisitions</a></li>
</ul>

这是我希望数据的样子:

Production, Red Claw
Production, Haven
Production, Frame
Distribution, Broadside Attractions
Distribution, Global Acquisitions

我可以获取两个列表的所有元素,但我无法获取 id。我的代码如下所示:

    for h4 in soup.find_all('h4', attrs={'class':'dataHeaderWithBorder'}):
id = h4.get_text()
#print(id)
for ul in h4.find_all('ul', attrs={'class':'simpleList'}):
#print(ul)
# Find the items that mention a budget
productionCompany = ul.find_all('a')
for company in productionCompany:
text = company.get_text()
print(id, text)
productionComps.append(id, text)

我不知道如何从每个 h4 标签中获取 ID。如果我删除前两行并将 h4.find_all 替换为 soup.find_all,我的输出结果将如下所示。

Red Claw
Haven
Frame
Broadside Attractions
Global Acquisition

最佳答案

使用压缩包

h4_list=soup.find_all('h4', attrs={'class':'dataHeaderWithBorder'})
ul_list=soup.find_all('ul', attrs={'class':'simpleList'})
productionComps=[]
for h4,ul in zip(h4_list,ul_list):
id_ = h4.get_text()
productionCompany = ul.find_all('a')
for company in productionCompany:
text = company.get_text()
print(id_, text)
productionComps.append((id_, text))

关于python - 使用/BeautifulSoup : Assign H4 Header ID to Elements in a List 进行网页抓取,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/56453506/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com