gpt4 book ai didi

python - Web 抓取表可以正确读取错误数据

转载 作者:太空宇宙 更新时间:2023-11-03 11:38:18 26 4
gpt4 key购买 nike

我正试图从 ESPN Neo York Knicks 2019 中抓取这张表,但是从网站上看,数据与实际被抓取的数据不同enter image description here

因此,在确保我正确执行并搜索其他站点以获取实际日期后,我正在抓取的数据似乎是正确的,但显示的 ESPN 上的值是错误的!?

这是我的代码:

import requests,bs4,re,time,random
headers = {'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.109 Safari/537.36'}
url="http://www.espn.com/nba/team/schedule/_/name/ny/season/2019"
req=requests.get(url,headers=headers)
soup=bs4.BeautifulSoup(req.text,"html.parser")
table=soup.find("tbody",{"class":"Table2__tbody"})
x=table.find_all("tr")
dates=[]
for i in x:
try:
regex2=re.search('(\w\w\w, \w\w\w \d+)vs.*',i.text).groups() #date=Wed, Oct 17 from where??
#print(i.text)
dates.append(regex2[0])
except AttributeError:
pass
y=table.find_all(class_="Table2__td")
links=[]
for i in y:
if i.find("a",href=True):
temp = i.find("a", href=True)
#print(temp['href'])
if "gameId" in temp['href']:
links.append(temp['href'])
print(dates)
dictionary = dict(zip(dates, links))
print(dictionary)

输出:

['Wed, Oct 17', 'Sat, Oct 20', 'Fri, Oct 26', 'Mon, Oct 29', 'Wed, Oct 31', 'Mon, Nov 5', 'Sun, Nov 11', 'Tue, Nov 20', 'Fri, Nov 23', 'Sat, Dec 1', 'Mon, Dec 3', 'Sat, Dec 8', 'Sun, Dec 9', 'Mon, Dec 17', 'Fri, Dec 21', 'Tue, Dec 25', 'Fri, Jan 11', 'Sun, Jan 13', 'Thu, Jan 17', 'Mon, Jan 21', 'Wed, Jan 23', 'Sun, Jan 27', 'Wed, Jan 30', 'Fri, Feb 1', 'Sun, Feb 3', 'Tue, Feb 5', 'Sat, Feb 9', 'Wed, Feb 13', 'Fri, Feb 22', 'Sun, Feb 24', 'Tue, Feb 26', 'Thu, Feb 28', 'Sat, Mar 9', 'Sun, Mar 17', 'Wed, Mar 20', 'Fri, Mar 22', 'Sun, Mar 24', 'Thu, Mar 28', 'Sat, Mar 30', 'Mon, Apr 1', 'Sun, Apr 7', 'Wed, Apr 10']
{'Wed, Oct 17': 'http://www.espn.com/nba/game?gameId=401070697', 'Sat, Oct 20': 'http://www.espn.com/nba/game?gameId=401070704', 'Fri, Oct 26': 'http://www.espn.com/nba/game?gameId=401070711', 'Mon, Oct 29': 'http://www.espn.com/nba/game?gameId=401070723', 'Wed, Oct 31': 'http://www.espn.com/nba/game?gameId=401070735', 'Mon, Nov 5': 'http://www.espn.com/nba/game?gameId=401070749', 'Sun, Nov 11': 'http://www.espn.com/nba/game?gameId=401070771', 'Tue, Nov 20': 'http://www.espn.com/nba/game?gameId=401070786', 'Fri, Nov 23': 'http://www.espn.com/nba/game?gameId=401070802', 'Sat, Dec 1': 'http://www.espn.com/nba/game?gameId=401070816', 'Mon, Dec 3': 'http://www.espn.com/nba/game?gameId=401070824', 'Sat, Dec 8': 'http://www.espn.com/nba/game?gameId=401070836', 'Sun, Dec 9': 'http://www.espn.com/nba/game?gameId=401070855', 'Mon, Dec 17': 'http://www.espn.com/nba/game?gameId=401070867', 'Fri, Dec 21': 'http://www.espn.com/nba/game?gameId=401070890', 'Tue, Dec 25': 'http://www.espn.com/nba/game?gameId=401070903', 'Fri, Jan 11': 'http://www.espn.com/nba/game?gameId=401070917', 'Sun, Jan 13': 'http://www.espn.com/nba/game?gameId=401070932', 'Thu, Jan 17': 'http://www.espn.com/nba/game?gameId=401070936', 'Mon, Jan 21': 'http://www.espn.com/nba/game?gameId=401070950', 'Wed, Jan 23': 'http://www.espn.com/nba/game?gameId=401070972', 'Sun, Jan 27': 'http://www.espn.com/nba/game?gameId=401070982', 'Wed, Jan 30': 'http://www.espn.com/nba/game?gameId=401070988', 'Fri, Feb 1': 'http://www.espn.com/nba/game?gameId=401071011', 'Sun, Feb 3': 'http://www.espn.com/nba/game?gameId=401071027', 'Tue, Feb 5': 'http://www.espn.com/nba/game?gameId=401071046', 'Sat, Feb 9': 'http://www.espn.com/nba/game?gameId=401071063', 'Wed, Feb 13': 'http://www.espn.com/nba/game?gameId=401071071', 'Fri, Feb 22': 'http://www.espn.com/nba/game?gameId=401071087', 'Sun, Feb 24': 'http://www.espn.com/nba/game?gameId=401071102', 'Tue, Feb 26': 'http://www.espn.com/nba/game?gameId=401071119', 'Thu, Feb 28': 'http://www.espn.com/nba/game?gameId=401071125', 'Sat, Mar 9': 'http://www.espn.com/nba/game?gameId=401071138', 'Sun, Mar 17': 'http://www.espn.com/nba/game?gameId=401071153', 'Wed, Mar 20': 'http://www.espn.com/nba/game?gameId=401070233', 'Fri, Mar 22': 'http://www.espn.com/nba/game?gameId=401071189', 'Sun, Mar 24': 'http://www.espn.com/nba/game?gameId=401071208', 'Thu, Mar 28': 'http://www.espn.com/nba/game?gameId=401071227', 'Sat, Mar 30': 'http://www.espn.com/nba/game?gameId=401071250', 'Mon, Apr 1': 'http://www.espn.com/nba/game?gameId=401071273', 'Sun, Apr 7': 'http://www.espn.com/nba/game?gameId=401071281', 'Wed, Apr 10': 'http://www.espn.com/nba/game?gameId=401071299'}

最佳答案

这是比赛在给定时区发生在哪一天的问题。看起来该网站呈现的日期与您提出请求的 tz 相关,根据您的评论,该日期似乎在美国或日本的某个地方。这些之间的较大差距可能会导致显示不同的日期。

例如,第一场比赛于 2018 年 10 月 17 日晚上 7:30 在纽约举行,时区为 UTC-4。但是那个时候equates到 2018 年 10 月 18 日早上 8 点 30 分在东京,那里是 UTC+9:

enter image description here

关于python - Web 抓取表可以正确读取错误数据,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/54971516/

26 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com