python - 在 Python 中使用 Pandas 和 numpy 合并抓取的数据时遇到问题-6ren

python - 在 Python 中使用 Pandas 和 numpy 合并抓取的数据时遇到问题

转载作者：行者123 更新时间：2023-11-30 22:31:48

我正在尝试从许多不同的网址收集信息，并根据年份和高尔夫球手姓名组合数据。截至目前，我正在尝试将信息写入 csv，然后使用 pd.merge() 进行匹配，但我必须为要合并的每个数据框使用唯一名称。我尝试使用一个 numpy 数组，但我陷入了获取所有要合并的所有单独数据的最终过程。

import csv
from urllib.request import urlopen
from bs4 import BeautifulSoup
import datetime
import socket
import urllib.error
import pandas as pd
import urllib
import sqlalchemy
import numpy as np

base = 'http://www.pgatour.com/'
inn = 'stats/stat'
end = '.html'
years = ['2017','2016','2015','2014','2013']

alpha = []
#all pages with links to tables
urls = ['http://www.pgatour.com/stats.html','http://www.pgatour.com/stats/categories.ROTT_INQ.html','http://www.pgatour.com/stats/categories.RAPP_INQ.html','http://www.pgatour.com/stats/categories.RARG_INQ.html','http://www.pgatour.com/stats/categories.RPUT_INQ.html','http://www.pgatour.com/stats/categories.RSCR_INQ.html','http://www.pgatour.com/stats/categories.RSTR_INQ.html','http://www.pgatour.com/stats/categories.RMNY_INQ.html','http://www.pgatour.com/stats/categories.RPTS_INQ.html']
for i in urls:
    data = urlopen(i)
    soup = BeautifulSoup(data, "html.parser")
    for link in soup.find_all('a'):
        if link.has_attr('href'):
            alpha.append(base + link['href'][17:]) #may need adjusting
#data links
beta = []
for i in alpha:
    if inn in i:
        beta.append(i)
#no repeats
gamma= []
for i in beta:
    if i not in gamma:
        gamma.append(i)

#making list of urls with Statistic labels
jan = []
for i in gamma:
    try:
        data = urlopen(i)
        soup = BeautifulSoup(data, "html.parser")
        for table in soup.find_all('section',{'class':'module-statistics-off-the-tee-details'}):
            for j in table.find_all('h3'):
                y=j.get_text().replace(" ","").replace("-","").replace(":","").replace(">","").replace("<","").replace(">","").replace(")","").replace("(","").replace("=","").replace("+","")
                jan.append([i,str(y+'.csv')])
                print([i,str(y+'.csv')])
    except Exception as e:
            print(e)
            pass

# practice url
#jan = [['http://www.pgatour.com/stats/stat.02356.html', 'Last15EventsScoring.csv']]
#grabbing data
#write to csv
row_sp = []
rows_sp =[]
title1 = [] 
title = []  
for i in jan:
    try:
        with open(i[1], 'w+') as fp:
            writer = csv.writer(fp)
            for y in years:
                data = urlopen(i[0][:-4] +y+ end)
                soup = BeautifulSoup(data, "html.parser")
                data1 = urlopen(i[0])
                soup1 = BeautifulSoup(data1, "html.parser")
                for table in soup1.find_all('table',{'id':'statsTable'}):
                    title.append('year')
                    for k in table.find_all('tr'):
                        for n in k.find_all('th'):
                            title1.append(n.get_text())
                            for l in title1:
                                if l not in title:
                                    title.append(l)
                    rows_sp.append(title)
                for table in soup.find_all('table',{'id':'statsTable'}):
                    for h in table.find_all('tr'):
                        row_sp = [y]
                        for j in h.find_all('td'):
                            row_sp.append(j.get_text().replace(" ","").replace("\n","").replace("\xa0"," ").replace("d",""))
                        rows_sp.append(row_sp)
                        print(row_sp)
                        writer.writerows([row_sp])
    except Exception as e:
        print(e)
        pass

dfs = [df1,df2,df3] # store dataframes in one list
df_merge = reduce(lambda  left,right: pd.merge(left,right,on=['v1'], how='outer'), dfs)

url、统计类型、所需格式
......只是介于两者之间的所有东西
试图在一行上获取数据
以下数据的网址 [' http://www.pgatour.com/stats/stat.02356.html ',' http://www.pgatour.com/stats/stat.02568.html ',...,' http://www.pgatour.com/stats/stat.111.html ']

统计标题

LAST 15 EVENTS - SCORING, SG: APPROACH-THE-GREEN, ..., SAND SAVE PERCENTAGE
year rankthisweek  ranklastweek   name         events   rating    rounds avg
2017 2             3             Rickie Fowler  10      8.8       62    .614    
TOTAL SG:APP   MEASURED ROUNDS   .... %     # SAVES    # BUNKERS    TOTAL O/U PAR
26.386         43                ....70.37    76           108          +7.00

最佳答案

更新 (根据评论)
这个问题部分是关于技术方法的(Pandas merge())，但它似乎也是一个讨论数据收集和清理的有用工作流程的机会。因此，我添加了比编码解决方案严格要求的更多细节和解释。

您基本上可以使用与我原始答案相同的方法从不同的 URL 类别中获取数据。我建议保留 {url:data} 的列表dicts 遍历 URL 列表，然后从该 dict 构建清理的数据帧。

设置清理部分涉及一些繁琐的工作，因为您需要针对每个 URL 类别中的不同列进行调整。我已经使用手动方法进行了演示，只使用了几个测试 URL。但是，如果您有数千个不同的 URL 类别，那么您可能需要考虑如何以编程方式收集和组织列名。这感觉超出了这个 OP 的范围。

只要你确定有 year和 PLAYER NAME每个 URL 中的字段，以下合并应该可以工作。和以前一样，假设您不需要写入 CSV，现在让我们停止对抓取代码进行任何优化:

首先，定义urls中的url类别.通过 url 类别，我指的是 http://www.pgatour.com/stats/stat.02356.html实际上会通过在 url 本身中插入一系列年份来多次使用，例如:http://www.pgatour.com/stats/stat.02356.2017.html , http://www.pgatour.com/stats/stat.02356.2016.html .在本例中，stat.02356.html是包含多年玩家数据信息的 url 类别。

import pandas as pd

# test urls given by OP
# note: each url contains >= 1 data fields not shared by the others
urls = ['http://www.pgatour.com/stats/stat.02356.html',
        'http://www.pgatour.com/stats/stat.02568.html',
        'http://www.pgatour.com/stats/stat.111.html']

# we'll store data from each url category in this dict.
url_data = {}

现在迭代 urls .内 urls循环，此代码与我的原始答案完全相同，而原始答案又来自 OP-仅调整了一些变量名称以反射(reflect)我们的新捕获方法。

for url in urls:
    print("url: ", url)
    url_data[url] = {"row_sp": [],
                     "rows_sp": [],
                     "title1": [],
                     "title": []}
    try:
        #with open(i[1], 'w+') as fp:
            #writer = csv.writer(fp)
        for y in years:
            current_url = url[:-4] +y+ end
            print("current url is: ", current_url)
            data = urlopen(current_url)
            soup = BeautifulSoup(data, "html.parser")
            data1 = urlopen(url)
            soup1 = BeautifulSoup(data1, "html.parser")
            for table in soup1.find_all('table',{'id':'statsTable'}):
                url_data[url]["title"].append('year')
                for k in table.find_all('tr'):
                    for n in k.find_all('th'):
                        url_data[url]["title1"].append(n.get_text())
                        for l in url_data[url]["title1"]:
                            if l not in url_data[url]["title"]:
                                url_data[url]["title"].append(l)
                url_data[url]["rows_sp"].append(url_data[url]["title"])
            for table in soup.find_all('table',{'id':'statsTable'}):
                for h in table.find_all('tr'):
                    url_data[url]["row_sp"] = [y]
                    for j in h.find_all('td'):
                        url_data[url]["row_sp"].append(j.get_text().replace(" ","").replace("\n","").replace("\xa0"," ").replace("d",""))
                    url_data[url]["rows_sp"].append(url_data[url]["row_sp"])
                    #print(row_sp)
                    #writer.writerows([row_sp])
    except Exception as e:
        print(e)
        pass

现在为每个键 url在 url_data , rows_sp包含您对该特定 url 类别感兴趣的数据。
请注意 rows_sp现在实际上是 url_data[url]["rows_sp"]当我们迭代 url_data ，但接下来的几个代码块来自我原来的答案，所以使用旧的 rows_sp变量名。

# example rows_sp
[['year',
  'RANK THIS WEEK',
  'RANK LAST WEEK',
  'PLAYER NAME',
  'EVENTS',
  'RATING',
  'year',
  'year',
  'year',
  'year'],
 ['2017'],
 ['2017', '1', '1', 'Sam Burns', '1', '9.2'],
 ['2017', '2', '3', 'Rickie Fowler', '10', '8.8'],
 ['2017', '2', '2', 'Dustin Johnson', '10', '8.8'],
 ['2017', '2', '3', 'Whee Kim', '2', '8.8'],
 ['2017', '2', '3', 'Thomas Pieters', '3', '8.8'],
 ...
]

写作 rows_sp直接到数据框表明数据的格式不完全正确:

pd.DataFrame(rows_sp).head()
      0               1               2               3       4       5     6  \
0  year  RANK THIS WEEK  RANK LAST WEEK     PLAYER NAME  EVENTS  RATING  year   
1  2017            None            None            None    None    None  None   
2  2017               1               1       Sam Burns       1     9.2  None   
3  2017               2               3   Rickie Fowler      10     8.8  None   
4  2017               2               2  Dustin Johnson      10     8.8  None   

      7     8     9  
0  year  year  year  
1  None  None  None  
2  None  None  None  
3  None  None  None  
4  None  None  None  

pd.DataFrame(rows_sp).dtypes
0    object
1    object
2    object
3    object
4    object
5    object
6    object
7    object
8    object
9    object
dtype: object

稍微清理一下，我们可以得到 rows_sp放入具有适当数字数据类型的数据框中:

df = pd.DataFrame(rows_sp, columns=rows_sp[0]).drop(0)
df.columns = ["year","RANK THIS WEEK","RANK LAST WEEK",
              "PLAYER NAME","EVENTS","RATING",
              "year1","year2","year3","year4"]
df.drop(["year1","year2","year3","year4"], 1, inplace=True)
df = df.loc[df["PLAYER NAME"].notnull()]
df = df.loc[df.year != "year"]
num_cols = ["RANK THIS WEEK","RANK LAST WEEK","EVENTS","RATING"]
df[num_cols] = df[num_cols].apply(pd.to_numeric)

df.head()
   year  RANK THIS WEEK  RANK LAST WEEK     PLAYER NAME  EVENTS  RATING
2  2017               1             1.0       Sam Burns       1     9.2
3  2017               2             3.0   Rickie Fowler      10     8.8
4  2017               2             2.0  Dustin Johnson      10     8.8
5  2017               2             3.0        Whee Kim       2     8.8
6  2017               2             3.0  Thomas Pieters       3     8.8

更新清洁
现在我们有一系列 url 类别要处理，每个类别都有一组不同的字段要清理，上面的部分变得有点复杂。如果您只有几页，则只需直观地查看每个类别的字段并将它们存储起来可能是可行的，如下所示:

cols = {'stat.02568.html':{'columns':['year', 'RANK THIS WEEK', 'RANK LAST WEEK', 
                                      'PLAYER NAME', 'ROUNDS', 'AVERAGE', 
                                      'TOTAL SG:APP', 'MEASURED ROUNDS', 
                                      'year1', 'year2', 'year3', 'year4'],
                           'numeric':['RANK THIS WEEK', 'RANK LAST WEEK', 'ROUNDS', 
                                      'AVERAGE', 'TOTAL SG:APP', 'MEASURED ROUNDS',]
                          },
        'stat.111.html':{'columns':['year', 'RANK THIS WEEK', 'RANK LAST WEEK', 
                                    'PLAYER NAME', 'ROUNDS', '%', '# SAVES', '# BUNKERS', 
                                    'TOTAL O/U PAR', 'year1', 'year2', 'year3', 'year4'],
                         'numeric':['RANK THIS WEEK', 'RANK LAST WEEK', 'ROUNDS',
                                   '%', '# SAVES', '# BUNKERS', 'TOTAL O/U PAR']
                        },
        'stat.02356.html':{'columns':['year', 'RANK THIS WEEK', 'RANK LAST WEEK',
                                      'PLAYER NAME', 'EVENTS', 'RATING', 
                                      'year1', 'year2', 'year3', 'year4'],
                           'numeric':['RANK THIS WEEK', 'RANK LAST WEEK', 
                                      'EVENTS', 'RATING']
                          }
       }

然后你可以循环 url_data再次存储在 dfs收藏:

dfs = {}

for url in url_data:
    page = url.split("/")[-1]
    colnames = cols[page]["columns"]
    num_cols = cols[page]["numeric"]
    rows_sp = url_data[url]["rows_sp"]
    df = pd.DataFrame(rows_sp, columns=rows_sp[0]).drop(0)
    df.columns = colnames
    df.drop(["year1","year2","year3","year4"], 1, inplace=True)
    df = df.loc[df["PLAYER NAME"].notnull()]
    df = df.loc[df.year != "year"]
    # tied ranks (e.g. "T9") mess up to_numeric; remove the tie indicators.
    df["RANK THIS WEEK"] = df["RANK THIS WEEK"].str.replace("T","")
    df["RANK LAST WEEK"] = df["RANK LAST WEEK"].str.replace("T","")
    df[num_cols] = df[num_cols].apply(pd.to_numeric)
    dfs[url] = df

至此，我们准备好了 merge year 的所有不同数据类别和 PLAYER NAME . (实际上，您可以在清洁循环中迭代合并，但我在这里分开是为了说明。)

master = pd.DataFrame()
for url in dfs:
    if master.empty:
        master = dfs[url]
    else:
        master = master.merge(dfs[url], on=['year','PLAYER NAME'])

现在 master包含每个玩家年份的合并数据。这是数据 View ，使用 groupby() :

master.groupby(["PLAYER NAME", "year"]).first().head(4)
                  RANK THIS WEEK_x  RANK LAST WEEK_x  EVENTS  RATING  \
PLAYER NAME year                                                       
Aam Hawin   2015                66              66.0       7     8.2   
            2016                80              80.0      12     8.1   
            2017                72              45.0       8     8.2   
Aam Scott   2013                45              45.0      10     8.2   

                  RANK THIS WEEK_y  RANK LAST WEEK_y  ROUNDS_x  AVERAGE  \
PLAYER NAME year                                                          
Aam Hawin   2015               136               136        95   -0.183   
            2016               122               122        93   -0.061   
            2017                56                52        84    0.296   
Aam Scott   2013                16                16        61    0.548   

                  TOTAL SG:APP  MEASURED ROUNDS  RANK THIS WEEK  \
PLAYER NAME year                                                  
Aam Hawin   2015       -14.805               81              86   
            2016        -5.285               87              39   
            2017        18.067               61               8   
Aam Scott   2013        24.125               44              57   

                  RANK LAST WEEK  ROUNDS_y      %  # SAVES  # BUNKERS  \
PLAYER NAME year                                                        
Aam Hawin   2015              86        95  50.96       80        157   
            2016              39        93  54.78       86        157   
            2017               6        84  61.90       91        147   
Aam Scott   2013              57        61  53.85       49         91   

                  TOTAL O/U PAR  
PLAYER NAME year                 
Aam Hawin   2015           47.0  
            2016           43.0  
            2017           27.0  
Aam Scott   2013           11.0

您可能希望对合并的列进行更多清理，因为有些列在数据类别中重复(例如 ROUNDS_x 和 ROUNDS_y )。据我所知，重复的字段名称似乎包含完全相同的信息，因此您可以直接删除 _y每个版本。

关于python - 在 Python 中使用 Pandas 和 numpy 合并抓取的数据时遇到问题，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/45697765/

文章推荐： python - 澄清作者的意思(学习Python第五版)

文章推荐： C# 具有不同方法的相同命名类

文章推荐： python - 重命名多个文件夹内的多个文件

javascript - 使用 WebScriptEndpoint 使用 javascript 使用 WCF 服务
我在网上搜索但没有找到任何合适的文章解释如何使用 javascript 使用 WCF 服务，尤其是 WebScriptEndpoint。任何人都可以对此给出任何指导吗？谢谢最佳答案这是一篇关于
c - 没有结果!!使用 fork() 使用 dup2 使用 2 个管道运行 execlp()
我正在编写一个将运行 Linux 命令的 C 程序，例如: cat/etc/passwd | grep 列表 |剪切-c 1-5 我没有任何结果 *这里 parent 等待第一个 child (chi
python - 处理文件上传，使用 Pillow 调整大小，使用 SQLAlchemy 存储，使用 Flask 提供文件
所以我正在尝试处理文件上传，然后将该文件作为二进制文件存储到数据库中。在我存储它之后，我尝试在给定的 URL 上提供文件。我似乎找不到适合这里的方法。我需要使用数据库，因为我使用 Google 应用引
excel - 使用 IF 使用 VBA 在单元格中添加公式的问题
我正在尝试制作一个宏，将下面的公式添加到单元格中，然后将其拖到整个列中并在 H 列中复制相同的公式我想在 F 和 H 列中输入公式的数据 Range("F1").formula = "=IF(ISE
使用 OperatorPrecedenceParser 使用 FParsec 解析函数应用程序？
问题类似于this one ，但我想使用 OperatorPrecedenceParser 解析带有函数应用程序的表达式在 FParsec . 这是我的 AST: type Expression =
sql - 使用 sequelize 使用 where 查询编码计数
我想通过使用 sequelize 和 node.js 将这个查询更改为代码取决于在哪里 select COUNT(gender) as genderCount from customers where
bash - 使用 “let”分配Bash失败，使用 “/”
我正在使用GNU bash，版本5.0.3(1)-发行版(x86_64-pc-linux-gnu)，我想知道为什么简单的赋值语句会出现语法错误: #/bin/bash var1=/tmp
javascript - 使用 JavaScript 使用 FOR OF 数组循环时出现错误？
这里，为什么我的代码在 IE 中不起作用。我的代码适用于所有浏览器。没有问题。但是当我在 IE 上运行我的项目时，它发现错误。而且我的 jquery 类和 insertadjacentHTMl 也不
javascript - 使用 javascript 使用 for 属性更改表单标签内容
我正在尝试更改标签的innerHTML。我无权访问该表单，因此无法编辑 HTML。标签具有的唯一标识符是“for”属性。这是输入和标签的结构:
javascript - 使用 jquery 使用 .on() 将事件附加到页面上的动态插入按钮
我有一个页面，我可以在其中返回用户帖子，可以使用一些 jquery 代码对这些帖子进行即时评论，在发布新评论后，我在帖子下插入新评论以及删除按钮。问题是 Delete 按钮在新插入的元素上不起作用，
使用 awk 使用 sha1sum 进行散列
我有一个大约有 20 列的“管道分隔”文件。我只想使用 sha1sum 散列第一列，它是一个数字，如帐号，并按原样返回其余列。使用 awk 或 sed 执行此操作的最佳方法是什么？ Accounti
mysql - 使用 insert into 使用 mysql
我需要将以下内容插入到我的表中...我的用户表有五列 id、用户名、密码、名称、条目。 (我还没有提交任何东西到条目中，我稍后会使用 php 来做)但由于某种原因我不断收到这个错误:#1054 - U
jquery - 将输入字段值修剪为仅字母数字字符/使用 .使用 jQuery
所以我试图有一个输入字段，我可以在其中输入任何字符，但然后将输入的值小写，删除任何非字母数字字符，留下“。”而不是空格。例如，如果我输入: 地球的 70% 是水，-!*#$^^ & 30% 土地输
javascript - 使用 .innerHTML 使用 DOM
我正在尝试做一些我认为非常简单的事情，但出于某种原因我没有得到想要的结果？我是 javascript 的新手，但对 java 有经验，所以我相信我没有使用某种正确的规则。这是一个获取输入值、检查选择
php - 使用 angularjs 使用 where 子句从数据库获取数据
我想使用 angularjs 从 mysql 数据库加载数据。这就是应用程序的工作原理；用户登录，他们的用户名存储在 cookie 中。该用户名显示在主页上我想获取这个值并通过 angularjs
ios - 使用 UITableViewCell 使用 AutoLayout
我正在使用 autoLayout，我想在 UITableViewCell 上放置一个 UIlabel，它应该始终位于单元格的右侧和右侧的中心。这就是我想要实现的目标所以在这里你可以看到我正在谈论的
mysql - 使用 ElasticSearch 使用 or 和运算符搜索多个字段
我需要与 MySql 等效的 elasticsearch 查询。我的 sql 查询: SELECT DISTINCT t.product_id AS id FROM tbl_sup_price t
ios - 使用 Swift 使用 JSON
我正在实现代码以使用 JSON。 func setup() { if let flickrURL = NSURL(string: "https://api.flickr.com/
javascript - 使用 JavaScript 使用 for 循环声明变量
我尝试使用for循环声明变量，然后测试cols和rols是否相同。如果是，它将运行递归函数。但是，我在 javascript 中执行 do 时遇到问题。有人可以帮忙吗？现在，在比较 col.1 和
jquery - 使用 :after 使用 jquery 更改样式
我举了一个我正在处理的问题的简短示例。 HTML代码: 1 2 3 CSS 代码: .BB a:hover{ color: #000; } .BB > li:after {

行者123

个人简介

我是一名优秀的程序员,十分优秀！

作者热门文章

滴滴打车优惠券免费领取

全站热门文章

首页

博学

6Ren·AI

商城

python - 在 Python 中使用 Pandas 和 numpy 合并抓取的数据时遇到问题