gpt4 book ai didi

python - Beautifulsoup HTML表格解析——只能获取到最后一行?

转载 作者:行者123 更新时间:2023-11-28 02:52:38 26 4
gpt4 key购买 nike

我有一个简单的 HTML 表格要解析,但不知何故 Beautifulsoup 只能从最后一行中获取结果。我想知道是否有人会看一下,看看有什么问题。所以我已经从 HTML 表中创建了行对象:

 <table class='participants-table'>
<thead>
<tr>
<th data-field="name" class="sort-direction-toggle name">Name</th>
<th data-field="type" class="sort-direction-toggle type active-sort asc">Type</th>
<th data-field="sector" class="sort-direction-toggle sector">Sector</th>
<th data-field="country" class="sort-direction-toggle country">Country</th>
<th data-field="joined_on" class="sort-direction-toggle joined-on">Joined On</th>
</tr>
</thead>
<tbody>
<tr>
<th class='name'><a href="/what-is-gc/participants/4479-Grontmij">Grontmij</a></th>
<td class='type'>Company</td>
<td class='sector'>General Industrials</td>
<td class='country'>Netherlands</td>
<td class='joined-on'>2000-09-20</td>
</tr>
<tr>
<th class='name'><a href="/what-is-gc/participants/4492-Groupe-Bial">Groupe Bial</a></th>
<td class='type'>Company</td>
<td class='sector'>Pharmaceuticals &amp; Biotechnology</td>
<td class='country'>Portugal</td>
<td class='joined-on'>2004-02-19</td>
</tr>
</tbody>
</table>

我使用以下代码获取行:

table=soup.find_all("table", class_="participants-table")
table1=table[0]
rows=table1.find_all('tr')
rows=rows[1:]

这得到:

rows=[<tr>
<th class="name"><a href="/what-is-gc/participants/4479-Grontmij">Grontmij</a></th>
<td class="type">Company</td>
<td class="sector">General Industrials</td>
<td class="country">Netherlands</td>
<td class="joined-on">2000-09-20</td>
</tr>, <tr>
<th class="name"><a href="/what-is-gc/participants/4492-Groupe-Bial">Groupe Bial</a></th>
<td class="type">Company</td>
<td class="sector">Pharmaceuticals &amp; Biotechnology</td>
<td class="country">Portugal</td>
<td class="joined-on">2004-02-19</td>
</tr>]

不出所料,看起来像。但是,如果我继续:

for row in rows:
cells = row.find_all('th')

我只能得到最后一个条目!

cells=[<th class="name"><a href="/what-is-gc/participants/4492-Groupe-Bial">Groupe Bial</a></th>]

这是怎么回事?这是我第一次使用 beautifulsoup,我想做的是将此表导出为 CSV。任何帮助是极大的赞赏!谢谢

最佳答案

如果你想将所有 th 标签放在一个列表中,你需要扩展,你只需不断重新分配 cells = row.find_all('th') 所以当你的打印单元格在循环外时,你会只看到最后分配给它的是什么,即最后一个 tr 中的最后一个:

cells = []
for row in rows:
cells.extend(row.find_all('th'))

此外,由于只有一张表,您可以使用查找:

soup = BeautifulSoup(html)

table = soup.find("table", class_="participants-table")

如果您想跳过 thead 行,您可以使用 css 选择器:

from bs4 import BeautifulSoup

soup = BeautifulSoup(html)

rows = soup.select("table.participants-table thead ~ tr")

cells = [tr.th for tr in rows]
print(cells)

细胞会给你:

[<th class="name"><a href="/what-is-gc/participants/4479-Grontmij">Grontmij</a></th>, <th class="name"><a href="/what-is-gc/participants/4492-Groupe-Bial">Groupe Bial</a></th>]

将整个表格写入 csv:

import csv

soup = BeautifulSoup(html, "html.parser")

rows = soup.select("table.participants-table tr")

with open("data.csv", "w") as out:
wr = csv.writer(out)
wr.writerow([th.text for th in rows[0].find_all("th")] + ["URL"])

for row in rows[1:]:
wr.writerow([tag.text for tag in row.find_all()] + [row.th.a["href"]])

你的样本会给你:

Name,Type,Sector,Country,Joined On,URL
Grontmij,Company,General Industrials,Netherlands,2000-09-20,/what-is-gc/participants/4479-Grontmij
Groupe Bial,Company,Pharmaceuticals & Biotechnology,Portugal,2004-02-19,/what-is-gc/participants/4492-Groupe-Bial

关于python - Beautifulsoup HTML表格解析——只能获取到最后一行?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/38753246/

26 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com