gpt4 book ai didi

Rvest 读取包含跨多行单元格的表格

转载 作者:行者123 更新时间:2023-12-02 01:31:07 25 4
gpt4 key购买 nike

我正在尝试抓取 irregular table来自维基百科,使用 rvest。该表具有跨多行的单元格。 documentation for html_table 明确指出这是一个限制。我只是想知道是否有解决方法。

table看起来像这样: enter image description here

我的代码:

library(rvest)
url <- "https://en.wikipedia.org/wiki/Arizona_League"
parks <- url %>%
read_html() %>%
html_nodes(xpath='/html/body/div[3]/div[3]/div[4]/div/table[2]') %>%
html_table(fill=TRUE) %>% # fill=FALSE yields the same results
.[[1]]

返回:

enter image description here

其中存在多个错误,例如:“City”下的第 4 行应该是“Mesa”,而不是“Chicago Cubs”。我对空白单元格很满意,因为我可以根据需要“填写”,但错误的数据是一个问题。非常感谢您的帮助。

最佳答案

我有办法编码。它并不完美,有点长,但它确实有效:

library(rvest)
url <- "https://en.wikipedia.org/wiki/Arizona_League"

# get the lines of the table
lines <- url %>%
read_html() %>%
html_nodes(xpath="//table[starts-with(@class, 'wikitable')]") %>%
html_nodes(xpath = 'tbody/tr')

#define the empty table
ncol <- lines %>%
.[[1]] %>%
html_children()%>%
length()
nrow <- length(lines)
table <- as.data.frame(matrix(nrow = nrow,ncol = ncol))

# fill the table
for(i in 1:nrow){
# get content of the line
linecontent <- lines[[i]]%>%
html_children()%>%
html_text()%>%
gsub("\n","",.)

# attribute the content to free columns
colselect <- is.na(table[i,])
table[i,colselect] <- linecontent

# get the line repetition of each columns
repetition <- lines[[i]]%>%
html_children()%>%
html_attr("rowspan")%>%
ifelse(is.na(.),1,.) %>% # if no rowspan, then it is a normal row, not a multiple one
as.numeric

# repeat the cells of the multiple rows down
for(j in 1:length(repetition)){
span <- repetition[j]
if(span > 1){
table[(i+1):(i+span-1),colselect][,j] <- rep(linecontent[j],span-1)
}
}
}

这个想法是通过获取 /tr 节点将表格的 html 行存储在 lines 变量中。然后,我创建一个空表:列数是第一行子行的长度(因为它包含标题),行数是行的长度。我在 for 循环中手动填充它(这里没有更好的方法)。

困难在于,当当前行上已经存在多行列时,行中给出的列文本量会发生变化。例如:

  lines[[3]]%>%
html_children()%>%
html_text()%>%
gsub("\n","",.)

仅给出 5 个值:

[1] "Arizona League Athletics Gold" "Oakland Athletics"             "Mesa"                          "Fitch Park"                   
[5] "10,000"

而不是 6 列,因为第一列位于 8 行上的East。此East 值仅出现在其跨越的第一行上。

技巧是当单元格具有 rowspan 属性(意味着它们跨越多行)时,在表格中向下重复单元格。它允许在下一行中仅选择 NA 列,以便 html 行给出的文本量与我们填充的表中的空闲列数相匹配。

这是通过 colselect 变量完成的,该变量是一个 bool 值,在重复给定行的单元格之前给出空闲行。

结果:

         V1                             V2                   V3         V4                                 V5       V6
1 Division Team MLB Affiliation City Stadium Capacity
2 East Arizona League Angels Los Angeles Angels Tempe Tempe Diablo Stadium 9,785
3 East Arizona League Athletics Gold Oakland Athletics Mesa Fitch Park 10,000
4 East Arizona League Athletics Green Oakland Athletics Mesa Fitch Park 10,000
5 East Arizona League Cubs 1 Chicago Cubs Mesa Sloan Park 15,000
6 East Arizona League Cubs 2 Chicago Cubs Mesa Sloan Park 15,000
7 East Arizona League Diamondbacks Arizona Diamondbacks Scottsdale Salt River Fields at Talking Stick 11,000
8 East Arizona League Giants Black San Francisco Giants Scottsdale Scottsdale Stadium 12,000
9 East Arizona League Giants Orange San Francisco Giants Scottsdale Scottsdale Stadium 12,000
10 Central Arizona League Brewers Gold Milwaukee Brewers Phoenix American Family Fields of Phoenix 8,000
11 Central Arizona League Dodgers Lasorda Los Angeles Dodgers Phoenix Camelback Ranch 12,000
12 Central Arizona League Indians Blue Cleveland Indians Goodyear Goodyear Ballpark 10,000
13 Central Arizona League Padres 2 San Diego Padres Peoria Peoria Sports Complex 12,882
14 Central Arizona League Reds Cincinnati Reds Goodyear Goodyear Ballpark 10,000
15 Central Arizona League White Sox Chicago White Sox Phoenix Camelback Ranch 12,000
16 West Arizona League Brewers Blue Milwaukee Brewers Phoenix American Family Fields of Phoenix 8,000
17 West Arizona League Dodgers Mota Los Angeles Dodgers Phoenix Camelback Ranch 12,000
18 West Arizona League Indians Red Cleveland Indians Goodyear Goodyear Ballpark 10,000
19 West Arizona League Mariners Seattle Mariners Peoria Peoria Sports Complex 12,882
20 West Arizona League Padres 1 San Diego Padres Peoria Peoria Sports Complex 12,882
21 West Arizona League Rangers Texas Rangers Surprise Surprise Stadium 10,500
22 West Arizona League Royals Kansas City Royals Surprise Surprise Stadium 10,500
<小时/>

编辑

我制作了该函数的较短版本,并提供了更多解释 here

关于Rvest 读取包含跨多行单元格的表格,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/57279093/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com