gpt4 book ai didi

r - 抓取维基百科表格

转载 作者:行者123 更新时间:2023-12-04 09:38:17 25 4
gpt4 key购买 nike

我使用 r 抓取了一个维基百科表格

library(rvest)

url <- "https://en.wikipedia.org/wiki/New_York_City"
nyc <- url %>%
read_html() %>%
html_node(xpath = '//*[@id="mw-content-text"]/div/table[1]') %>%
html_table(fill = TRUE)

并希望将值保存到新的数据框中。

输出
Area           population
468.484 sq mi 8,336,817

做这个的最好方式是什么?

最佳答案

从 OP 的示例输出来看,他们希望在与他们在问题中提供的不同的 xpath 中给出表格。请参阅以下工作流程,注意:名称已手动设置,以节省从行格式化字符串的麻烦:

# Initialise package in session: rvest => .GlobalEnv()
library(rvest)

# Store the url scalar: url => character vector
url <- "https://en.wikipedia.org/wiki/New_York_City"

# Scrape the table and store it memory: nyc => data.frame
nyc <-
url %>%
read_html() %>%
html_node(xpath = '/html/body/div[3]/div[3]/div[4]/div/table[3]') %>%
html_table(fill = TRUE) %>%
data.frame()

# Set the names appropriately: names(nyc) character vector
names(nyc) <- c("borough", "county", "pop_est_2019",
"gdp_bill_usd", "gdp_per_cap",
"land_area_sq_mi", "land_area_sq_km",
"density_pop_sq_mi", "density_pop_sq_km")

# Coerce the vectors to the appropriate type: cleaned => data.frame
cleaned <- data.frame(lapply(nyc[4:nrow(nyc)-1,], function(x){
if(length(grep("\\d+\\,\\d+$|^\\d+\\.\\d+$", x)) > 0){
as.numeric(trimws(gsub("\\,", "", as.character(x)), "both"))
}else{
as.factor(x)
}
}
)
)

关于r - 抓取维基百科表格,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/62443786/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com