gpt4 book ai didi

python - 使用 BeautifulSoup 为每个子页面抓取数据 - url 很长且格式不同

转载 作者:行者123 更新时间:2023-12-04 04:05:03 25 4
gpt4 key购买 nike

我正在抓取 NFL passing data从 1971 年到 2019 年。我能够使用以下代码在每年的第一页上抓取数据:

# This code works:
passingData = [] # create empty list to store column data
for year in range(1971,2020):
url = 'https://www.nfl.com/stats/player-stats/category/passing/%s/REG/all/passingyards/desc' % (year)
response = requests.get(url)
response = response.content
parsed_html = bsoup(response, 'html.parser')
data_rows = parsed_html.find_all('tr')
passingData.append([[col.text.strip() for col in row.find_all('td')] for row in data_rows])

每年的第一页只有25名球员,每年大约有70-90名球员传球(所以每年“子页面”上有3-4页球员数据)。当我尝试抓取这些子页面时,问题就来了。我尝试添加另一个子循环,将每个链接的 href 拉出到下一页,并附加到 div 类 'nfl-o-table-pagination__buttons'

不幸的是,我无法从第一页添加到 passingData 列表。我尝试了以下操作,但 subUrl 行出现“索引超出范围错误”。

我对网络抓取还是个新手,所以如果我的逻辑不对请告诉我。我想我可以只附加子页面数据(因为表结构是相同的),但是当我尝试从以下位置开始时似乎出现了错误:
https://www.nfl.com/stats/player-stats/category/passing/%s/REG/all/passingyards/desc
到第二页,其网址为:
https://www.nfl.com/stats/player-stats/category/passing/2019/REG/all/passingYards/DESC?aftercursor=0000001900000000008500100079000840a7a000000000006e00000005000000045f74626c00000010706572736f6e5f7465616d5f737461740000000565736249640000000944415234363631343100000004726f6c6500000003504c5900000008736561736f6e496400000004323031390000000a736561736f6e5479706500000003524547f07fffffe6f07fffffe6389bd3f93412939a78c1e6950d620d060004

    for subPage in range(1971,2020):
subPassingData = []
subUrl = soup.select('.nfl-o-table-pagination__buttons a')[0]['href']
new = requests.get(f"{url}{subUrl}")
newResponse = new.content
soup1 = bsoup(new.text, 'html.parser')
sub_data_rows = soup1.find_all('tr')
subPassingData.append([[col.text.strip() for col in row.find_all('td')] for row in data_rows])

passingData.append(subPassingData)

感谢您的帮助。

最佳答案

此脚本适用于所有选定的年份和子页面,并将数据加载到数据框(或者您可以将其保存为 csv,等等...):

import requests
from bs4 import BeautifulSoup

url = 'https://www.nfl.com/stats/player-stats/category/passing/{year}/REG/all/passingyards/desc'
headers = {'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:77.0) Gecko/20100101 Firefox/77.0'}

all_data = []

for year in range(2017, 2020): # <-- change to desired year
soup = BeautifulSoup(requests.get(url.format(year=year), headers=headers).content, 'html.parser')
page = 1

while True:
print('Page {}/{}...'.format(page, year))

for tr in soup.select('tr:has(td)'):
tds = [year] + [td.get_text(strip=True) for td in tr.select('td')]
all_data.append(tds)

next_url = soup.select_one('.nfl-o-table-pagination__next')
if not next_url:
break

u = 'https://www.nfl.com' + next_url['href']
soup = BeautifulSoup(requests.get(u, headers=headers).content, 'html.parser')
page += 1


# here we create dataframe from the list `all_data` and print it to screen:
from pandas import pd
df = pd.DataFrame(all_data)
print(df)

打印:

Page 1/2017...
Page 2/2017...
Page 3/2017...
Page 4/2017...
Page 1/2018...
Page 2/2018...
Page 3/2018...
Page 4/2018...
Page 1/2019...
Page 2/2019...
Page 3/2019...
Page 4/2019...
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
0 2017 Tom Brady 4577 7.9 581 385 0.663 32 8 102.8 230 0.396 62 10 64 35 201
1 2017 Philip Rivers 4515 7.9 575 360 0.626 28 10 96 216 0.376 61 12 75 18 120
2 2017 Matthew Stafford 4446 7.9 565 371 0.657 29 10 99.3 209 0.37 61 16 71 47 287
3 2017 Drew Brees 4334 8.1 536 386 0.72 23 8 103.9 201 0.375 72 11 54 20 145
4 2017 Ben Roethlisberger 4251 7.6 561 360 0.642 28 14 93.4 207 0.369 52 14 97 21 139
.. ... ... ... ... ... ... ... .. .. ... ... ... .. .. .. .. ...
256 2019 Trevor Siemian 3 0.5 6 3 0.5 0 0 56.3 0 0 0 0 3 2 17
257 2019 Blake Bortles 3 1.5 2 1 0.5 0 0 56.3 0 0 0 0 3 0 0
258 2019 Kenjon Barner 3 3 1 1 1 0 0 79.2 0 0 0 0 3 0 0
259 2019 Alex Tanney 1 1 1 1 1 0 0 79.2 0 0 0 0 1 0 0
260 2019 Matt Haack 1 1 1 1 1 1 0 118.8 1 1 0 0 1 0 0

[261 rows x 17 columns]

关于python - 使用 BeautifulSoup 为每个子页面抓取数据 - url 很长且格式不同,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/62645312/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com