gpt4 book ai didi

html - 从 R 中多个网页的表格中抓取数据(足球运动员)

转载 作者:太空狗 更新时间:2023-10-29 14:11:04 25 4
gpt4 key购买 nike

我正在为学校开展一个项目,我需要收集 NCAA 橄榄球运动员个人的职业统计数据。每个玩家的数据都是这种格式。

http://www.sports-reference.com/cfb/players/ryan-aplin-1.html

我找不到所有球员的总和,所以我需要一页一页地拉出每个传球得分、冲球和 catch 等html表的最后一行

每个玩家都按他们的姓氏分类,这里有指向每个字母表的链接。

http://www.sports-reference.com/cfb/players/

例如,这里可以找到每个姓 A 的玩家。

http://www.sports-reference.com/cfb/players/a-index.html

这是我第一次真正接触数据抓取,所以我试图找到类似的问题和答案。我找到的最接近的答案是 this question

我相信我可以使用非常相似的方法,将页码与收集到的玩家姓名进行切换。但是,我不确定如何更改它以查找播放器名称而不是页码。

Samuel L. Ventura 最近也发表了关于 NFL 数据的数据抓取的演讲,可以找到 here .

编辑:

Ben 真的很有帮助并提供了一些很棒的代码。第一部分工作得很好,但是当我尝试运行第二部分时,我遇到了这个问题。

> # unlist into a single character vector
> links <- unlist(links)
> # Go to each URL in the list and scrape all the data from the tables
> # this will take some time... don't interrupt it!
> all_tables <- lapply(links, readHTMLTable, stringsAsFactors = FALSE)
Error in UseMethod("xmlNamespaceDefinitions") :
no applicable method for 'xmlNamespaceDefinitions' applied to an object of class "NULL"
> # Put player names in the list so we know who the data belong to
> # extract names from the URLs to their stats page...
> toMatch <- c("http://www.sports-reference.com/cfb/players/", "-1.html")
> player_names <- unique (gsub(paste(toMatch,collapse="|"), "", links))
Error: cannot allocate vector of size 512 Kb
> # assign player names to list of tables
> names(all_tables) <- player_names
Error: object 'player_names' not found
> fix(inx_page)
Error in edit(name, file, title, editor) :
unexpected '<' occurred on line 1
use a command like
x <- edit()
to recover
In addition: Warning message:
In edit.default(name, file, title, editor = defaultEditor) :
deparse may be incomplete

这可能是由于没有足够的内存(我目前使用的计算机上只有 4gb)而导致的错误。虽然不明白错误

    > all_tables <- lapply(links, readHTMLTable, stringsAsFactors = FALSE)
Error in UseMethod("xmlNamespaceDefinitions") :
no applicable method for 'xmlNamespaceDefinitions' applied to an object of class "NULL"

查看我的其他数据集,我的玩家实际上只能追溯到 2007 年。如果有某种方法可以只提取 2007 年以后的人,这可能有助于缩小数据。如果我有一个我想提取名字的人的列表,我可以只替换 lnk 吗

 links[[i]] <- paste0("http://www.sports-reference.com", lnk)

只有我需要的球员?

最佳答案

以下是您可以轻松获取所有玩家页面上所有表格中的所有数据的方法...

首先列出所有玩家页面的 URL...

require(RCurl); require(XML)
n <- length(letters)
# pre-allocate list to fill
links <- vector("list", length = n)
for(i in 1:n){
print(i) # keep track of what the function is up to
# get all html on each page of the a-z index pages
inx_page <- htmlParse(getURI(paste0("http://www.sports-reference.com/cfb/players/", letters[i], "-index.html")))
# scrape URLs for each player from each index page
lnk <- unname(xpathSApply(inx_page, "//a/@href"))
# skip first 63 and last 10 links as they are constant on each page
lnk <- lnk[-c(1:63, (length(lnk)-10):length(lnk))]
# only keep links that go to players (exclude schools)
lnk <- lnk[grep("players", lnk)]
# now we have a list of all the URLs to all the players on that index page
# but the URLs are incomplete, so let's complete them so we can use them from
# anywhere
links[[i]] <- paste0("http://www.sports-reference.com", lnk)
}
# unlist into a single character vector
links <- unlist(links)

现在我们有大约 67,000 个 URL 的矢量(似乎有很多玩家,对吗?),所以:

其次,抓取每个 URL 上的所有表以获取它们的数据,如下所示:

# Go to each URL in the list and scrape all the data from the tables
# this will take some time... don't interrupt it!
# start edit1 here - just so you can see what's changed
# pre-allocate list
all_tables <- vector("list", length = (length(links)))
for(i in 1:length(links)){
print(i)
# error handling - skips to next URL if it gets an error
result <- try(
all_tables[[i]] <- readHTMLTable(links[i], stringsAsFactors = FALSE)
); if(class(result) == "try-error") next;
}
# end edit1 here
# Put player names in the list so we know who the data belong to
# extract names from the URLs to their stats page...
toMatch <- c("http://www.sports-reference.com/cfb/players/", "-1.html")
player_names <- unique (gsub(paste(toMatch,collapse="|"), "", links))
# assign player names to list of tables
names(all_tables) <- player_names

结果如下所示(这只是输出的一个片段):

all_tables
$`neli-aasa`
$`neli-aasa`$defense
Year School Conf Class Pos Solo Ast Tot Loss Sk Int Yds Avg TD PD FR Yds TD FF
1 *2007 Utah MWC FR DL 2 1 3 0.0 0.0 0 0 0 0 0 0 0 0
2 *2010 Utah MWC SR DL 4 4 8 2.5 1.5 0 0 0 1 0 0 0 0

$`neli-aasa`$kick_ret
Year School Conf Class Pos Ret Yds Avg TD Ret Yds Avg TD
1 *2007 Utah MWC FR DL 0 0 0 0 0 0
2 *2010 Utah MWC SR DL 2 24 12.0 0 0 0 0

$`neli-aasa`$receiving
Year School Conf Class Pos Rec Yds Avg TD Att Yds Avg TD Plays Yds Avg TD
1 *2007 Utah MWC FR DL 1 41 41.0 0 0 0 0 1 41 41.0 0
2 *2010 Utah MWC SR DL 0 0 0 0 0 0 0 0 0

最后,假设我们只想查看传递表...

# just show passing tables
passing <- lapply(all_tables, function(i) i$passing)
# but lots of NULL in here, and not a convenient format, so...
passing <- do.call(rbind, passing)

我们最终得到了一个可以进行进一步分析的数据框(也只是一个片段)...

             Year             School Conf Class Pos Cmp Att  Pct  Yds Y/A AY/A TD Int  Rate
james-aaron 1978 Air Force Ind QB 28 56 50.0 316 5.6 3.6 1 3 92.6
jeff-aaron.1 2000 Alabama-Birmingham CUSA JR QB 100 182 54.9 1135 6.2 6.0 5 3 113.1
jeff-aaron.2 2001 Alabama-Birmingham CUSA SR QB 77 148 52.0 828 5.6 4.3 4 6 99.8

关于html - 从 R 中多个网页的表格中抓取数据(足球运动员),我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/20319321/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com