gpt4 book ai didi

python - 在 Python 中使用 Pandas 和 numpy 合并抓取的数据时遇到问题

转载 作者:行者123 更新时间:2023-11-30 22:31:48 25 4
gpt4 key购买 nike

我正在尝试从许多不同的网址收集信息,并根据年份和高尔夫球手姓名组合数据。截至目前,我正在尝试将信息写入 csv,然后使用 pd.merge() 进行匹配,但我必须为要合并的每个数据框使用唯一名称。我尝试使用一个 numpy 数组,但我陷入了获取所有要合并的所有单独数据的最终过程。

import csv
from urllib.request import urlopen
from bs4 import BeautifulSoup
import datetime
import socket
import urllib.error
import pandas as pd
import urllib
import sqlalchemy
import numpy as np

base = 'http://www.pgatour.com/'
inn = 'stats/stat'
end = '.html'
years = ['2017','2016','2015','2014','2013']

alpha = []
#all pages with links to tables
urls = ['http://www.pgatour.com/stats.html','http://www.pgatour.com/stats/categories.ROTT_INQ.html','http://www.pgatour.com/stats/categories.RAPP_INQ.html','http://www.pgatour.com/stats/categories.RARG_INQ.html','http://www.pgatour.com/stats/categories.RPUT_INQ.html','http://www.pgatour.com/stats/categories.RSCR_INQ.html','http://www.pgatour.com/stats/categories.RSTR_INQ.html','http://www.pgatour.com/stats/categories.RMNY_INQ.html','http://www.pgatour.com/stats/categories.RPTS_INQ.html']
for i in urls:
data = urlopen(i)
soup = BeautifulSoup(data, "html.parser")
for link in soup.find_all('a'):
if link.has_attr('href'):
alpha.append(base + link['href'][17:]) #may need adjusting
#data links
beta = []
for i in alpha:
if inn in i:
beta.append(i)
#no repeats
gamma= []
for i in beta:
if i not in gamma:
gamma.append(i)

#making list of urls with Statistic labels
jan = []
for i in gamma:
try:
data = urlopen(i)
soup = BeautifulSoup(data, "html.parser")
for table in soup.find_all('section',{'class':'module-statistics-off-the-tee-details'}):
for j in table.find_all('h3'):
y=j.get_text().replace(" ","").replace("-","").replace(":","").replace(">","").replace("<","").replace(">","").replace(")","").replace("(","").replace("=","").replace("+","")
jan.append([i,str(y+'.csv')])
print([i,str(y+'.csv')])
except Exception as e:
print(e)
pass

# practice url
#jan = [['http://www.pgatour.com/stats/stat.02356.html', 'Last15EventsScoring.csv']]
#grabbing data
#write to csv
row_sp = []
rows_sp =[]
title1 = []
title = []
for i in jan:
try:
with open(i[1], 'w+') as fp:
writer = csv.writer(fp)
for y in years:
data = urlopen(i[0][:-4] +y+ end)
soup = BeautifulSoup(data, "html.parser")
data1 = urlopen(i[0])
soup1 = BeautifulSoup(data1, "html.parser")
for table in soup1.find_all('table',{'id':'statsTable'}):
title.append('year')
for k in table.find_all('tr'):
for n in k.find_all('th'):
title1.append(n.get_text())
for l in title1:
if l not in title:
title.append(l)
rows_sp.append(title)
for table in soup.find_all('table',{'id':'statsTable'}):
for h in table.find_all('tr'):
row_sp = [y]
for j in h.find_all('td'):
row_sp.append(j.get_text().replace(" ","").replace("\n","").replace("\xa0"," ").replace("d",""))
rows_sp.append(row_sp)
print(row_sp)
writer.writerows([row_sp])
except Exception as e:
print(e)
pass

dfs = [df1,df2,df3] # store dataframes in one list
df_merge = reduce(lambda left,right: pd.merge(left,right,on=['v1'], how='outer'), dfs)

url、统计类型、所需格式
......只是介于两者之间的所有东西
试图在一行上获取数据
以下数据的网址 [' http://www.pgatour.com/stats/stat.02356.html ',' http://www.pgatour.com/stats/stat.02568.html ',...,' http://www.pgatour.com/stats/stat.111.html ']

统计标题
LAST 15 EVENTS - SCORING, SG: APPROACH-THE-GREEN, ..., SAND SAVE PERCENTAGE
year rankthisweek ranklastweek name events rating rounds avg
2017 2 3 Rickie Fowler 10 8.8 62 .614
TOTAL SG:APP MEASURED ROUNDS .... % # SAVES # BUNKERS TOTAL O/U PAR
26.386 43 ....70.37 76 108 +7.00

最佳答案

更新 (根据评论)
这个问题部分是关于技术方法的(Pandas merge()),但它似乎也是一个讨论数据收集和清理的有用工作流程的机会。因此,我添加了比编码解决方案严格要求的更多细节和解释。

您基本上可以使用与我原始答案相同的方法从不同的 URL 类别中获取数据。我建议保留 {url:data} 的列表dicts 遍历 URL 列表,然后从该 dict 构建清理的数据帧。

设置清理部分涉及一些繁琐的工作,因为您需要针对每个 URL 类别中的不同列进行调整。我已经使用手动方法进行了演示,只使用了几个测试 URL。但是,如果您有数千个不同的 URL 类别,那么您可能需要考虑如何以编程方式收集和组织列名。这感觉超出了这个 OP 的范围。

只要你确定有 yearPLAYER NAME每个 URL 中的字段,以下合并应该可以工作。和以前一样,假设您不需要写入 CSV,现在让我们停止对抓取代码进行任何优化:

首先,定义urls中的url类别.通过 url 类别,我指的是 http://www.pgatour.com/stats/stat.02356.html实际上会通过在 url 本身中插入一系列年份来多次使用,例如:http://www.pgatour.com/stats/stat.02356.2017.html , http://www.pgatour.com/stats/stat.02356.2016.html .在本例中,stat.02356.html是包含多年玩家数据信息的 url 类别。

import pandas as pd

# test urls given by OP
# note: each url contains >= 1 data fields not shared by the others
urls = ['http://www.pgatour.com/stats/stat.02356.html',
'http://www.pgatour.com/stats/stat.02568.html',
'http://www.pgatour.com/stats/stat.111.html']

# we'll store data from each url category in this dict.
url_data = {}

现在迭代 urls .内 urls循环,此代码与我的原始答案完全相同,而原始答案又来自 OP-仅调整了一些变量名称以反射(reflect)我们的新捕获方法。
for url in urls:
print("url: ", url)
url_data[url] = {"row_sp": [],
"rows_sp": [],
"title1": [],
"title": []}
try:
#with open(i[1], 'w+') as fp:
#writer = csv.writer(fp)
for y in years:
current_url = url[:-4] +y+ end
print("current url is: ", current_url)
data = urlopen(current_url)
soup = BeautifulSoup(data, "html.parser")
data1 = urlopen(url)
soup1 = BeautifulSoup(data1, "html.parser")
for table in soup1.find_all('table',{'id':'statsTable'}):
url_data[url]["title"].append('year')
for k in table.find_all('tr'):
for n in k.find_all('th'):
url_data[url]["title1"].append(n.get_text())
for l in url_data[url]["title1"]:
if l not in url_data[url]["title"]:
url_data[url]["title"].append(l)
url_data[url]["rows_sp"].append(url_data[url]["title"])
for table in soup.find_all('table',{'id':'statsTable'}):
for h in table.find_all('tr'):
url_data[url]["row_sp"] = [y]
for j in h.find_all('td'):
url_data[url]["row_sp"].append(j.get_text().replace(" ","").replace("\n","").replace("\xa0"," ").replace("d",""))
url_data[url]["rows_sp"].append(url_data[url]["row_sp"])
#print(row_sp)
#writer.writerows([row_sp])
except Exception as e:
print(e)
pass

现在为每个键 urlurl_data , rows_sp包含您对该特定 url 类别感兴趣的数据。
请注意 rows_sp现在实际上是 url_data[url]["rows_sp"]当我们迭代 url_data ,但接下来的几个代码块来自我原来的答案,所以使用旧的 rows_sp变量名。
# example rows_sp
[['year',
'RANK THIS WEEK',
'RANK LAST WEEK',
'PLAYER NAME',
'EVENTS',
'RATING',
'year',
'year',
'year',
'year'],
['2017'],
['2017', '1', '1', 'Sam Burns', '1', '9.2'],
['2017', '2', '3', 'Rickie Fowler', '10', '8.8'],
['2017', '2', '2', 'Dustin Johnson', '10', '8.8'],
['2017', '2', '3', 'Whee Kim', '2', '8.8'],
['2017', '2', '3', 'Thomas Pieters', '3', '8.8'],
...
]

写作 rows_sp直接到数据框表明数据的格式不完全正确:
pd.DataFrame(rows_sp).head()
0 1 2 3 4 5 6 \
0 year RANK THIS WEEK RANK LAST WEEK PLAYER NAME EVENTS RATING year
1 2017 None None None None None None
2 2017 1 1 Sam Burns 1 9.2 None
3 2017 2 3 Rickie Fowler 10 8.8 None
4 2017 2 2 Dustin Johnson 10 8.8 None

7 8 9
0 year year year
1 None None None
2 None None None
3 None None None
4 None None None

pd.DataFrame(rows_sp).dtypes
0 object
1 object
2 object
3 object
4 object
5 object
6 object
7 object
8 object
9 object
dtype: object

稍微清理一下,我们可以得到 rows_sp放入具有适当数字数据类型的数据框中:
df = pd.DataFrame(rows_sp, columns=rows_sp[0]).drop(0)
df.columns = ["year","RANK THIS WEEK","RANK LAST WEEK",
"PLAYER NAME","EVENTS","RATING",
"year1","year2","year3","year4"]
df.drop(["year1","year2","year3","year4"], 1, inplace=True)
df = df.loc[df["PLAYER NAME"].notnull()]
df = df.loc[df.year != "year"]
num_cols = ["RANK THIS WEEK","RANK LAST WEEK","EVENTS","RATING"]
df[num_cols] = df[num_cols].apply(pd.to_numeric)

df.head()
year RANK THIS WEEK RANK LAST WEEK PLAYER NAME EVENTS RATING
2 2017 1 1.0 Sam Burns 1 9.2
3 2017 2 3.0 Rickie Fowler 10 8.8
4 2017 2 2.0 Dustin Johnson 10 8.8
5 2017 2 3.0 Whee Kim 2 8.8
6 2017 2 3.0 Thomas Pieters 3 8.8

更新清洁
现在我们有一系列 url 类别要处理,每个类别都有一组不同的字段要清理,上面的部分变得有点复杂。如果您只有几页,则只需直观地查看每个类别的字段并将它们存储起来可能是可行的,如下所示:
cols = {'stat.02568.html':{'columns':['year', 'RANK THIS WEEK', 'RANK LAST WEEK', 
'PLAYER NAME', 'ROUNDS', 'AVERAGE',
'TOTAL SG:APP', 'MEASURED ROUNDS',
'year1', 'year2', 'year3', 'year4'],
'numeric':['RANK THIS WEEK', 'RANK LAST WEEK', 'ROUNDS',
'AVERAGE', 'TOTAL SG:APP', 'MEASURED ROUNDS',]
},
'stat.111.html':{'columns':['year', 'RANK THIS WEEK', 'RANK LAST WEEK',
'PLAYER NAME', 'ROUNDS', '%', '# SAVES', '# BUNKERS',
'TOTAL O/U PAR', 'year1', 'year2', 'year3', 'year4'],
'numeric':['RANK THIS WEEK', 'RANK LAST WEEK', 'ROUNDS',
'%', '# SAVES', '# BUNKERS', 'TOTAL O/U PAR']
},
'stat.02356.html':{'columns':['year', 'RANK THIS WEEK', 'RANK LAST WEEK',
'PLAYER NAME', 'EVENTS', 'RATING',
'year1', 'year2', 'year3', 'year4'],
'numeric':['RANK THIS WEEK', 'RANK LAST WEEK',
'EVENTS', 'RATING']
}
}

然后你可以循环 url_data再次存储在 dfs收藏:
dfs = {}

for url in url_data:
page = url.split("/")[-1]
colnames = cols[page]["columns"]
num_cols = cols[page]["numeric"]
rows_sp = url_data[url]["rows_sp"]
df = pd.DataFrame(rows_sp, columns=rows_sp[0]).drop(0)
df.columns = colnames
df.drop(["year1","year2","year3","year4"], 1, inplace=True)
df = df.loc[df["PLAYER NAME"].notnull()]
df = df.loc[df.year != "year"]
# tied ranks (e.g. "T9") mess up to_numeric; remove the tie indicators.
df["RANK THIS WEEK"] = df["RANK THIS WEEK"].str.replace("T","")
df["RANK LAST WEEK"] = df["RANK LAST WEEK"].str.replace("T","")
df[num_cols] = df[num_cols].apply(pd.to_numeric)
dfs[url] = df

至此,我们准备好了 merge year 的所有不同数据类别和 PLAYER NAME . (实际上,您可以在清洁循环中迭代合并,但我在这里分开是为了说明。)
master = pd.DataFrame()
for url in dfs:
if master.empty:
master = dfs[url]
else:
master = master.merge(dfs[url], on=['year','PLAYER NAME'])

现在 master包含每个玩家年份的合并数据。这是数据 View ,使用 groupby() :
master.groupby(["PLAYER NAME", "year"]).first().head(4)
RANK THIS WEEK_x RANK LAST WEEK_x EVENTS RATING \
PLAYER NAME year
Aam Hawin 2015 66 66.0 7 8.2
2016 80 80.0 12 8.1
2017 72 45.0 8 8.2
Aam Scott 2013 45 45.0 10 8.2

RANK THIS WEEK_y RANK LAST WEEK_y ROUNDS_x AVERAGE \
PLAYER NAME year
Aam Hawin 2015 136 136 95 -0.183
2016 122 122 93 -0.061
2017 56 52 84 0.296
Aam Scott 2013 16 16 61 0.548

TOTAL SG:APP MEASURED ROUNDS RANK THIS WEEK \
PLAYER NAME year
Aam Hawin 2015 -14.805 81 86
2016 -5.285 87 39
2017 18.067 61 8
Aam Scott 2013 24.125 44 57

RANK LAST WEEK ROUNDS_y % # SAVES # BUNKERS \
PLAYER NAME year
Aam Hawin 2015 86 95 50.96 80 157
2016 39 93 54.78 86 157
2017 6 84 61.90 91 147
Aam Scott 2013 57 61 53.85 49 91

TOTAL O/U PAR
PLAYER NAME year
Aam Hawin 2015 47.0
2016 43.0
2017 27.0
Aam Scott 2013 11.0

您可能希望对合并的列进行更多清理,因为有些列在数据类别中重复(例如 ROUNDS_xROUNDS_y )。据我所知,重复的字段名称似乎包含完全相同的信息,因此您可以直接删除 _y每个版本。

关于python - 在 Python 中使用 Pandas 和 numpy 合并抓取的数据时遇到问题,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/45697765/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com