gpt4 book ai didi

python - 将表(几页)抓取到 Pandas Dataframe

转载 作者:行者123 更新时间:2023-12-04 07:59:00 31 4
gpt4 key购买 nike

我正在尝试将长表(24 页)的数据传输到 Pandas 数据帧,但面临(我认为)for 循环代码的一些问题。

import requests
from bs4 import BeautifulSoup
import pandas as pd

base_url = 'https://scrapethissite.com/pages/forms/?page_num={}'
res = requests.get(base_url.format('1'))
soup = BeautifulSoup(res.text, 'lxml')

table = soup.select('table.table')[0]
columns = table.find('tr').find_all('th')
columns_names = [str(c.get_text()).strip() for c in columns]
table_rows = table.find_all('tr', class_='team')

l = []
for n in range(1, 25):
scrape_url = base_url.format(n)
res = requests.get(scrape_url)
soup = BeautifulSoup(res.text, 'lxml')
for tr in table_rows:
td = tr.find_all('td')
row = [str(tr.get_text()).strip() for tr in td]
l.append(row)

df = pd.DataFrame(l, columns=columns_names)
Dataframe 仅作为第一页的重复出现,而不是表中所有数据的副本。

最佳答案

我同意@mxbi。
尝试一下:

import requests
from bs4 import BeautifulSoup
import pandas as pd

base_url = 'https://scrapethissite.com/pages/forms/?page_num={}'

l = []
for n in range(1, 25):
scrape_url = base_url.format(n)
res = requests.get(scrape_url)
soup = BeautifulSoup(res.text, 'lxml')

table = soup.select('table.table')[0]
columns = table.find('tr').find_all('th')
columns_names = [str(c.get_text()).strip() for c in columns]
table_rows = table.find_all('tr', class_='team')

for tr in table_rows:
td = tr.find_all('td')
row = [str(tr.get_text()).strip() for tr in td]
l.append(row)

df = pd.DataFrame(l, columns=columns_names)

关于python - 将表(几页)抓取到 Pandas Dataframe,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/66554530/

31 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com