gpt4 book ai didi

python - 使用 Selenium 抓取 Understat 图表数据的问题

转载 作者:行者123 更新时间:2023-12-01 01:11:02 25 4
gpt4 key购买 nike

我正在尝试在“计时表”选项卡下抓取图表数据 https://understat.com/match/9457 .

我的方法是使用 BeautifulSoap 和 Selenium,但我似乎无法让它工作。

这是我的 python 脚本:

from bs4 import BeautifulSoup
import requests

# Set the url we want
xg_url = 'https://understat.com/match/9457'

# Use requests to download the webpage
xg_data = requests.get(xg_url)

# Get the html code for the webpage
xg_html = xg_data.content

# Parse the html using bs4
soup = BeautifulSoup(xg_html, 'lxml')

#print(soup.prettify())
print(soup.title)

from selenium import webdriver
from selenium.webdriver.chrome.options import Options
options = Options()
options.add_argument("--no-sandbox")
options.add_argument("--headless")

driver = webdriver.Chrome("/usr/local/bin/chromedriver", chrome_options=options)

# Set up the Selenium driver (in this case I am using the Chrome browser)
options = webdriver.ChromeOptions()

# Tell the driver to navigate to the page url
driver.get(xg_url)

# Grab the html code from the webpage
soup = BeautifulSoup(driver.page_source, 'lxml')

# Get the table headers using 3 chained find operations
# 1. Find the div containing the table (div class = chemp jTable)
# 2. Find the table within that div
# 3. Find all 'th' elements where class = sort
headers = soup.find('div', attrs={'class':'scheme-block'}).find('div').find_all('div',attrs={'class':'chartjs-tooltip team-home is-hide'})

headers

# Iterate over headers, get the text from each item, and add the results to headers_list
headers_list = []
for header in headers:
headers_list.append(header.get_text(strip=True))
print(headers_list)

# You can also simply call elements like tables directly instead of using find('table') if you are only looking for the first instance of that element
body = soup.find('div', attrs={'class':'scheme-block'}).div

# Create a master list for row data
all_rows_list = []
# For each row in the table body
for tr in body.find_all('tr'):
# Get data from each cell in the row
row = tr.find_all('td')
# Create list to save current row data to
current_row = []
# For each item in the row variable
for item in row:
# Add the text data to the current_row list
current_row.append(item.get_text(strip=True))
# Add the current row data to the master list
all_rows_list.append(current_row)

# Create a dataframe where the rows = all_rows_list and columns = headers_list
xg_df = pd.DataFrame(all_rows_list, columns=headers_list)
xg_df

此代码取自不同的任务,我更改了一些内容以抓取 div 而不是表格,但查看数据,似乎还没有抓取图表。

有什么想法可能是错误的吗?

最佳答案

你让它变得比需要的更复杂了一点。如果您查看 <script> 标签,所有数据都在那里。大多数情况下,它已经是很好的 json 格式,只需要对字符串进行一些分割即可获得结构。在这种特殊情况下,您会发现它看起来有点不同:

<script>
var shotsData = JSON.parse('\x7B\x22h\x22\x3A\x5B\x7B\x22id\x22\x3A\x22271478\x22,\x22minute\x22\x3A\x226\x22,\x22result\x22\x3A\x22MissedShots\x22,\x22....

但不用担心,它仍然可以使用一些正则表达式来工作。我还将镜头数据和名单数据从 json 转换为数据帧,但比赛数据是包含所有值的单个键,因此不必担心,因为它只是 1 行。您甚至可能不需要数据框,只需要 json 格式的工作,但它已经为您准备好了:

import requests
import json
import re
from pandas.io.json import json_normalize
import pandas as pd

response = requests.get('https://understat.com/match/9457')

shotsData = re.search("shotsData\s+=\s+JSON.parse\('([^']+)", response.text)
decoded_string = bytes(shotsData.groups()[0], 'utf-8').decode('unicode_escape')
shotsObj = json.loads(decoded_string)

match_info = re.search("match_info\s+=\s+JSON.parse\('([^']+)", response.text)
decoded_string = bytes(match_info.groups()[0], 'utf-8').decode('unicode_escape')
matchObj = json.loads(decoded_string)


rostersData = re.search("rostersData\s+=\s+JSON.parse\('([^']+)", response.text)
decoded_string = bytes(rostersData.groups()[0], 'utf-8').decode('unicode_escape')
rostersObj = json.loads(decoded_string)


# Shots Data into a DataFrame
away_shots_df = json_normalize(shotsObj['a'])
home_shots_df = json_normalize(shotsObj['h'])
shots_df = away_shots_df.append(home_shots_df)



# Rosters Data into a DataFrame
away_rosters_df = pd.DataFrame()
for key, v in rostersObj['a'].items():
temp_df = pd.DataFrame.from_dict([v])
away_rosters_df = away_rosters_df.append(temp_df)

home_rosters_df = pd.DataFrame()
for key, v in rostersObj['h'].items():
temp_df = pd.DataFrame.from_dict([v])
home_rosters_df = home_rosters_df.append(temp_df)

rosters_df = away_rosters_df.append(home_rosters_df)

teams_dict = {'a':matchObj['team_a'], 'h':matchObj['team_h']}
match_title = matchObj['team_h'] + ' vs. ' + matchObj['team_a']

输出:

print (shots_df)
X ... xG
0 0.9069999694824219 ... 0.40696778893470764
1 0.8190000152587891 ... 0.05737118795514107
2 0.94 ... 0.5754774808883667
3 0.9319999694824219 ... 0.02447112277150154
4 0.725 ... 0.02365683950483799
5 0.7759999847412109 ... 0.026968277990818024
6 0.8619999694824219 ... 0.08384699374437332
7 0.7659999847412109 ... 0.013624735176563263
0 0.9269999694824219 ... 0.055443812161684036
1 0.835 ... 0.03609708696603775
2 0.9059999847412109 ... 0.03347432240843773
3 0.9769999694824218 ... 0.07148116827011108
4 0.9869999694824219 ... 0.9712227582931519
5 0.8390000152587891 ... 0.028583310544490814
6 0.8580000305175781 ... 0.07498162239789963
7 0.924000015258789 ... 0.04431726038455963
8 0.9569999694824218 ... 0.48726019263267517
9 0.9540000152587891 ... 0.06847231835126877
10 0.91 ... 0.07779974490404129
11 0.875999984741211 ... 0.04344969615340233
12 0.8780000305175781 ... 0.019344232976436615
13 0.789000015258789 ... 0.043812621384859085
14 0.9419999694824219 ... 0.34188181161880493
15 0.9 ... 0.05839642137289047
16 0.9069999694824219 ... 0.043319668620824814
17 0.8490000152587891 ... 0.058181893080472946
18 0.9019999694824219 ... 0.09132817387580872
19 0.87 ... 0.11395697295665741
20 0.8819999694824219 ... 0.035116128623485565

[29 rows x 20 columns]

额外

正如所怀疑的,时序图是由 'xG 中的 shotsData 列生成的。它只是每个团队的 xP 的运行总和。我还在最后提供了折线图,您可以将鼠标悬停在图表上。查看 plotly 。我以前用过它,它很棒,但是超出了问题的范围。但这是我做的一个快速的:

Timing Chart

#########################################################################
# Timing Chart is an aggregation (running sum) of xG from the shotsData
#########################################################################
import numpy as np

# Convert 'minute' astype int and sort the dataframe by 'minute'
shots_df['minute'] = shots_df['minute'].astype(int)
shots_df['xG'] = shots_df['xG'].astype(float)

timing_chart_df = shots_df[['h_a', 'minute', 'xG']].sort_values('minute')
timing_chart_df['h_a'] = timing_chart_df['h_a'].map(teams_dict)

# Get max value of the 'minute' column to interpolate minute interval between that range
max_value = timing_chart_df['minute'].max()

# Aggregate xG within the same minute
timing_chart_df = timing_chart_df.groupby(['h_a','minute'], as_index=False)['xG'].sum()

# Interpolate for each team/group
min_idx = np.arange(timing_chart_df['minute'].max() + 1)
m_idx = pd.MultiIndex.from_product([timing_chart_df['h_a'].unique(), min_idx], names=['h_a', 'minute'])


# Calculate the running sum
timing_chart_df = timing_chart_df.set_index(['h_a', 'minute']).reindex(m_idx, fill_value=0).reset_index()
timing_chart_df['running_sum_xG'] = timing_chart_df.groupby('h_a')['xG'].cumsum()


timing_chart_T_df = timing_chart_df.pivot(index='h_a', columns='minute', values='running_sum_xG')
timing_chart_T_df = timing_chart_T_df.reset_index().rename(columns={timing_chart_T_df.index.name:match_title})

输出:

print (timing_chart_T_df.to_string())
minute West Ham vs. Fulham 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92
0 Fulham 0.406968 0.464339 1.039816 1.039816 1.039816 1.039816 1.039816 1.064288 1.064288 1.064288 1.064288 1.064288 1.064288 1.064288 1.064288 1.064288 1.064288 1.064288 1.064288 1.064288 1.087944 1.087944 1.087944 1.087944 1.087944 1.087944 1.087944 1.087944 1.087944 1.087944 1.087944 1.087944 1.087944 1.087944 1.087944 1.087944 1.087944 1.087944 1.087944 1.087944 1.087944 1.087944 1.087944 1.087944 1.087944 1.087944 1.087944 1.087944 1.087944 1.087944 1.087944 1.087944 1.087944 1.087944 1.087944 1.114913 1.114913 1.114913 1.114913 1.114913 1.114913 1.114913 1.114913 1.114913 1.114913 1.114913 1.114913 1.114913 1.114913 1.114913 1.114913 1.114913 1.114913 1.114913 1.114913 1.114913 1.114913 1.114913 1.114913 1.114913 1.114913 1.114913 1.114913 1.198760 1.198760 1.198760 1.19876 1.19876 1.198760 1.198760 1.198760 1.198760 1.212384
1 West Ham 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.055444 0.055444 0.055444 0.055444 0.055444 0.055444 0.055444 0.055444 0.055444 0.055444 0.055444 0.055444 0.055444 0.055444 0.055444 0.055444 0.091541 0.091541 0.091541 0.091541 0.091541 0.091541 1.167719 1.167719 1.196302 1.196302 1.196302 1.196302 1.271284 1.271284 1.315601 1.315601 1.315601 1.802862 1.802862 1.871334 1.949134 1.949134 1.992583 2.011928 2.011928 2.011928 2.011928 2.011928 2.011928 2.011928 2.011928 2.011928 2.011928 2.011928 2.011928 2.011928 2.011928 2.011928 2.011928 2.011928 2.011928 2.011928 2.011928 2.011928 2.011928 2.011928 2.011928 2.011928 2.011928 2.011928 2.011928 2.011928 2.055740 2.055740 2.055740 2.397622 2.397622 2.397622 2.397622 2.397622 2.397622 2.397622 2.456018 2.499338 2.55752 2.55752 2.648848 2.762805 2.797921 2.797921 2.797921

绘制折线图:

import plotly
import plotly.plotly as py
import plotly.graph_objs as go

plotly.tools.set_credentials_file(username='username', api_key='xxxxxxxxxxx')

plotly.tools.set_config_file(world_readable=True)

# Create traces
trace0 = go.Scatter(
x = timing_chart_df[timing_chart_df['h_a'] == 'a']['minute'],
y = timing_chart_df[timing_chart_df['h_a'] == 'a']['running_sum_xG'],
mode = 'lines',
name = 'Fulham',
line = dict(
color = ('#E5E64B'),
width = 4)
)
trace1 = go.Scatter(
x = timing_chart_df[timing_chart_df['h_a'] == 'h']['minute'],
y = timing_chart_df[timing_chart_df['h_a'] == 'h']['running_sum_xG'],
mode = 'lines',
name = 'West Ham',
line = dict(
color = ('#00BCD4'),
width = 4)
)

data_comp = [trace0, trace1]

layout_comp = go.Layout(
autosize=False,
width=800,
height=600,



title='Timing Chart',
plot_bgcolor='#3E3E40',
hovermode='x',
xaxis=dict(
title='Minute',
ticklen=15,
zeroline=True,
showgrid=True,
gridcolor='#39393B',
gridwidth=2,
),
yaxis=dict(
title='xG',
ticklen=5,
gridwidth=2,
zeroline=True,
showgrid=True,
gridcolor='#39393B',
),
)

fig_comp = go.Figure(data=data_comp, layout=layout_comp)
py.iplot(fig_comp, filename='line-mode')

关于python - 使用 Selenium 抓取 Understat 图表数据的问题,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/54868228/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com