gpt4 book ai didi

python - 如何从表中获取所有 tr 元素并单击链接?

转载 作者:行者123 更新时间:2023-12-01 09:06:53 29 4
gpt4 key购买 nike

我正在尝试弄清楚如何打印表中的所有 tr 元素,但我无法让它正常工作。

这是我正在使用的链接。

https://en.wikipedia.org/wiki/List_of_current_members_of_the_United_States_Senate

这是我的代码。

import requests
from bs4 import BeautifulSoup

link = "https://en.wikipedia.org/wiki/List_of_current_members_of_the_United_States_Senate"

html = requests.get(link).text

# If you do not want to use requests then you can use the following code below
# with urllib (the snippet above). It should not cause any issue."""
soup = BeautifulSoup(html, "lxml")
res = soup.findAll("span", {"class": "fn"})
for r in res:
print("Name: " + r.find('a').text)
table_body=soup.find('senators')
rows = table_body.find_all('tr')
for row in rows:
cols=row.find_all('td')
cols=[x.text.strip() for x in cols]
print(cols)

我正在尝试打印名为 'senators' 的表中的所有 tr 元素。另外,我想知道是否有办法点击参议员的链接,例如'Richard Shelby',它会将我带到此:

https://en.wikipedia.org/wiki/Richard_Shelby

我想从每个链接获取“假设的办公室” 下的数据。在本例中,该值为:'2018 年 1 月 3 日'。所以,最终我想得到这样的结果:

Richard Shelby  May 6, 1934 (age 84)    Lawyer  U.S. House
Alabama Senate January 3, 1987 2022
Assumed office: January 3, 2018

我现在能得到的只是打印出来的每位参议员的名字。

最佳答案

为了找到“Senators”表,可以先找到对应的“Senators”label,然后get the first following table element :

soup.find(id='Senators').find_next("table")

现在,为了逐行获取数据,您必须考虑具有跨多行的“rowspan”的单元格。您可以按照What should I do when <tr> has rowspan中建议的方法进行操作,或我在下面提供的实现(并不理想,但适用于您的情况)。

import copy

import requests
from bs4 import BeautifulSoup


link = "https://en.wikipedia.org/wiki/List_of_current_members_of_the_United_States_Senate"


with requests.Session() as session:
html = session.get(link).text

soup = BeautifulSoup(html, "lxml")
senators_table = soup.find(id='Senators').find_next("table")

headers = [td.get_text(strip=True) for td in senators_table.tr('th')]

rows = senators_table.find_all('tr')

# pre-process table to account for rowspan, TODO: extract into a function
for row_index, tr in enumerate(rows):
for cell_index, td in enumerate(tr('td')):
if 'rowspan' in td.attrs:
rowspan = int(td['rowspan'])

del td.attrs['rowspan']

# insert same td into subsequent rows
for index in range(row_index + 1, row_index + rowspan):
try:
rows[index]('td')[cell_index].insert_after(copy.copy(td))
except IndexError:
continue

# extracting the desired data
rows = senators_table.find_all('tr')[1:]
for row in rows:
cells = [td.get_text(strip=True) for td in row('td')]
print(dict(zip(headers, cells)))
<小时/>

如果您愿意,请点击参议员“个人资料”页面的链接,您首先需要从一行中的相应单元格中提取链接,然后使用 session.get() 要“导航”到它,遵循以下原则:

senator_link = row.find_all('td')[3].a['href']
senator_link = urljoin(link, senator_link)
response = session.get(senator_link)

soup = BeautifulSoup(response.content, "lxml")

# TODO: parse

其中 urljoin 导入为:

from urllib.parse import urljoin
<小时/>

另外,仅供引用,使用 requests.Session() 的原因之一这里是优化向同一主机发出请求:

The Session object allows you to persist certain parameters across requests. It also persists cookies across all requests made from the Session instance, and will use urllib3’s connection pooling. So if you’re making several requests to the same host, the underlying TCP connection will be reused, which can result in a significant performance increase

<小时/>

还有另一种方法来解析表格数据 - .read_html()来自pandas。你可以这样做:

import pandas as pd

df = pd.read_html(str(senators_table))[0]
print(df.head())

获取所需的表作为数据框。

关于python - 如何从表中获取所有 tr 元素并单击链接?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/51970446/

29 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com