gpt4 book ai didi

python - 使用 BeautifulSoup 解析 NBA Boxscore 数据时出现问题

转载 作者:行者123 更新时间:2023-11-30 23:15:04 28 4
gpt4 key购买 nike

我正在尝试解析来自 EPSN 的球员级别 NBA 得分数据。以下是我尝试的初始部分:

import numpy as np
import pandas as pd
import requests
from bs4 import BeautifulSoup
from datetime import datetime, date

request = requests.get('http://espn.go.com/nba/boxscore?gameId=400277722')
soup = BeautifulSoup(request.text,'html.parser')
table = soup.find_all('table')

BeautifulSoup 似乎给了我一个奇怪的结果。源代码中的最后一个“表”包含玩家数据,这就是我想要提取的内容。在线查看源代码显示该表在第 421 行关闭,即在两支球队的得分之后。然而,如果我们看一下“汤”,就会在迈阿密统计数据之前添加一行来关闭表格。这发生在在线源代码的第 350 行。

解析器“html.parser”的输出是:

Game 1: Tuesday, October 30thCeltics107FinalHeat120Recap »Boxscore »
Game 2: Sunday, January 27thHeat98Final2OTCeltics100Recap »Boxscore »
Game 3: Monday, March 18thHeat105FinalCeltics103Recap »Boxscore »
Game 4: Friday, April 12thCeltics101FinalHeat109Recap »Boxscore »

1 2 3 4 T

BOS 25 29 22 31107MIA 31 31 31 27120

Boston Celtics
STARTERS
MIN
FGM-A
3PM-A
FTM-A
OREB
DREB
REB
AST
STL
BLK
TO
PF
+/-
PTS

Kevin Garnett, PF324-80-01-11111220254-49
Brandon Bass, PF286-110-03-4651110012-815
Paul Pierce, SF416-152-49-905552003-1723
Rajon Rondo, PG449-140-22-4077130044-1320
Courtney Lee, SG245-61-10-001110015-711
BENCH
MIN
FGM-A
3PM-A
FTM-A
OREB
DREB
REB
AST
STL
BLK
TO
PF
+/-
PTS

Jared Sullinger, PF81-20-00-001100001-32
Jeff Green, SF230-40-03-403301010-73
Jason Terry, SG252-70-34-400011033-108
Leandro Barbosa, SG166-83-31-201110001+416
Chris Wilcox, PFDNP COACH'S DECISION
Kris Joseph, SFDNP COACH'S DECISION
Jason Collins, CDNP COACH'S DECISION
Darko Milicic, CDNP COACH'S DECISIONTOTALS
FGM-A
3PM-A
FTM-A
OREB

正如您所看到的,它在“OREB”中排名中游,并且从未进入迈阿密热火队部分。使用“lxml”解析器的输出是:

Game 1: Tuesday, October 30thCeltics107FinalHeat120Recap »Boxscore »
Game 2: Sunday, January 27thHeat98Final2OTCeltics100Recap »Boxscore »
Game 3: Monday, March 18thHeat105FinalCeltics103Recap »Boxscore »
Game 4: Friday, April 12thCeltics101FinalHeat109Recap »Boxscore »

1 2 3 4T

BOS 25 29 22 31107MIA 31 31 31 27120

这根本不包括盒子分数。我正在使用的完整代码(由 Daniel Rodriguez 提供)如下所示:

import numpy as np
import pandas as pd
import requests
from bs4 import BeautifulSoup
from datetime import datetime, date

games = pd.read_csv('games_13.csv').set_index('id')
BASE_URL = 'http://espn.go.com/nba/boxscore?gameId={0}'

request = requests.get(BASE_URL.format(games.index[0]))
table = BeautifulSoup(request.text,'html.parser').find('table', class_='mod-data')
heads = table.find_all('thead')
headers = heads[0].find_all('tr')[1].find_all('th')[1:]
headers = [th.text for th in headers]
columns = ['id', 'team', 'player'] + headers

players = pd.DataFrame(columns=columns)

def get_players(players, team_name):
array = np.zeros((len(players), len(headers)+1), dtype=object)
array[:] = np.nan
for i, player in enumerate(players):
cols = player.find_all('td')
array[i, 0] = cols[0].text.split(',')[0]
for j in range(1, len(headers) + 1):
if not cols[1].text.startswith('DNP'):
array[i, j] = cols[j].text

frame = pd.DataFrame(columns=columns)
for x in array:
line = np.concatenate(([index, team_name], x)).reshape(1,len(columns))
new = pd.DataFrame(line, columns=frame.columns)
frame = frame.append(new)
return frame

for index, row in games.iterrows():
print(index)
request = requests.get(BASE_URL.format(index))
table = BeautifulSoup(request.text, 'html.parser').find('table', class_='mod-data')
heads = table.find_all('thead')
bodies = table.find_all('tbody')

team_1 = heads[0].th.text
team_1_players = bodies[0].find_all('tr') + bodies[1].find_all('tr')
team_1_players = get_players(team_1_players, team_1)
players = players.append(team_1_players)

team_2 = heads[3].th.text
team_2_players = bodies[3].find_all('tr') + bodies[4].find_all('tr')
team_2_players = get_players(team_2_players, team_2)
players = players.append(team_2_players)

players = players.set_index('id')
print(players)
players.to_csv('players_13.csv')

我想要的输出示例是:

,id,team,player,MIN,FGM-A,3PM-A,FTM-A,OREB,DREB,REB,AST,STL,BLK,TO,PF,+/-,PTS
0,400277722,Boston Celtics,Brandon Bass,28,6-11,0-0,3-4,6,5,11,1,0,0,1,2,-8,15
0,400277722,Boston Celtics,Paul Pierce,41,6-15,2-4,9-9,0,5,5,5,2,0,0,3,-17,23
...
0,400277722,Miami Heat,Shane Battier,29,2-4,2-3,0-0,0,2,2,1,1,0,0,3,+12,6
0,400277722,Miami Heat,LeBron James,29,10-16,2-4,4-5,1,9,10,3,2,0,0,2,+12,26

最佳答案

BeautifulSoup 也为我截断了部分结果,因此我用 re.findall 替换 soup.find_all 选项

r = br.open('http://espn.go.com/nba/boxscore?gameId=400277722')
html = r.read()
soup = BeautifulSoup(html)

statnames = re.search('STARTERS</th>.*?PTS</th>',html, re.DOTALL).group()
th = re.findall('th.*</th', statnames) # each th tag contains a statname
names = ['Name', 'Team']
for t in th:
t = re.sub('.*>','',t)
t = t.replace('</th','')
names.append(t)
print names

celts = re.search('Boston Celtics.*?Total Team Turnovers',html,re.DOTALL).group()
heat = re.search('nba-small-mia floatleft.*?Total Team Turnovers',html,re.DOTALL).group()

players = str(soup).split('td nowrap')
for player in players[1:len(players)]:
try:
stats = [re.search('[A-Z]?[a-z]?[A-Z][a-z]{1,} [A-Z][a-z]{1,}',player).group()]
except:
stats = [re.search('[A-Z]\.?[A-Z]?\.? [A-Z][a-z]{1,}',player).group()] # player name
if stats[0] in celts:
stats.append('Boston Celtics')
elif stats[0] in heat:
stats.append('Miami Heat')
td = re.findall('td.*?/td', player) # each td tag contains a stat
for t in td:
t = re.findall('>.*<',t)
t = re.sub('.*>','',t[0])
t = t.replace('<','')
if t!='' and t!='\xc2\xa0':
stats.append(t)
print stats

输出=

['Name', 'Team', 'MIN', 'FGM-A', '3PM-A', 'FTM-A', 'OREB', 'DREB', 'REB', 'AST', 'STL', 'BLK', 'TO', 'PF', '+/-', 'PTS']
['Kevin Garnett', 'Boston Celtics', '32', '4-8', '0-0', '1-1', '1', '11', '12', '2', '0', '2', '5', '4', '-4', '9']
['Brandon Bass', 'Boston Celtics', '28', '6-11', '0-0', '3-4', '6', '5', '11', '1', '0', '0', '1', '2', '-8', '15']
['Paul Pierce', 'Boston Celtics', '41', '6-15', '2-4', '9-9', '0', '5', '5', '5', '2', '0', '0', '3', '-17', '23']
['Rajon Rondo', 'Boston Celtics', '44', '9-14', '0-2', '2-4', '0', '7', '7', '13', '0', '0', '4', '4', '-13', '20']
['Courtney Lee', 'Boston Celtics', '24', '5-6', '1-1', '0-0', '0', '1', '1', '1', '0', '0', '1', '5', '-7', '11']
['Jared Sullinger', 'Boston Celtics', '8', '1-2', '0-0', '0-0', '0', '1', '1', '0', '0', '0', '0', '1', '-3', '2']
['Jeff Green', 'Boston Celtics', '23', '0-4', '0-0', '3-4', '0', '3', '3', '0', '1', '0', '1', '0', '-7', '3']
['Jason Terry', 'Boston Celtics', '25', '2-7', '0-3', '4-4', '0', '0', '0', '1', '1', '0', '3', '3', '-10', '8']
['Leandro Barbosa', 'Boston Celtics', '16', '6-8', '3-3', '1-2', '0', '1', '1', '1', '0', '0', '0', '1', '+4', '16']
['Chris Wilcox', 'Boston Celtics', "DNP COACH'S DECISION"]
['Kris Joseph', 'Boston Celtics', "DNP COACH'S DECISION"]
['Jason Collins', 'Boston Celtics', "DNP COACH'S DECISION"]
['Darko Milicic', 'Boston Celtics', "DNP COACH'S DECISION"]
['Shane Battier', 'Miami Heat', '29', '2-4', '2-3', '0-0', '0', '2', '2', '1', '1', '0', '0', '3', '+12', '6']
['LeBron James', 'Miami Heat', '29', '10-16', '2-4', '4-5', '1', '9', '10', '3', '2', '0', '0', '2', '+12', '26']
['Chris Bosh', 'Miami Heat', '37', '8-15', '0-1', '3-4', '2', '8', '10', '1', '0', '3', '1', '3', '+15', '19']
['Mario Chalmers', 'Miami Heat', '36', '3-7', '0-1', '2-2', '0', '1', '1', '11', '3', '0', '1', '3', '+11', '8']
['Dwyane Wade', 'Miami Heat', '35', '10-22', '0-0', '9-11', '2', '1', '3', '4', '2', '1', '4', '3', '-6', '29']
['Udonis Haslem', 'Miami Heat', '11', '0-1', '0-0', '0-0', '0', '3', '3', '0', '0', '0', '1', '1', '-2', '0']
['Rashard Lewis', 'Miami Heat', '19', '4-5', '1-2', '1-2', '0', '5', '5', '1', '0', '1', '0', '1', '+1', '10']
['Norris Cole', 'Miami Heat', '6', '1-2', '1-2', '0-0', '0', '0', '0', '1', '0', '0', '1', '2', '+5', '3']
['Ray Allen', 'Miami Heat', '31', '5-7', '2-3', '7-8', '0', '2', '2', '2', '0', '0', '0', '1', '+9', '19']
['Mike Miller', 'Miami Heat', '7', '0-0', '0-0', '0-0', '0', '0', '0', '1', '0', '0', '0', '1', '+8', '0']
['Josh Harrellson', 'Miami Heat', "DNP COACH'S DECISION"]
['James Jones', 'Miami Heat', "DNP COACH'S DECISION"]
['Terrel Harris', 'Miami Heat', "DNP COACH'S DECISION"]

catch D.J. Augustine,最简单(但并非最不简洁)的代码是:

try:
stats = [re.search('[A-Z]?[a-z]?[A-Z][a-z]{1,} [A-Z][a-z]{1,}',player).group()]
except:
stats = [re.search('[A-Z]\.?[A-Z]?\.? [A-Z][a-z]{1,}',player).group()]

关于python - 使用 BeautifulSoup 解析 NBA Boxscore 数据时出现问题,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/28447487/

28 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com