% html_text() 期望的输出: country toc US Etym-6ren">
gpt4 book ai didi

r - 从 url 列表中获取 (rvest) 多个 HTML 页面

转载 作者:行者123 更新时间:2023-12-04 09:24:54 28 4
gpt4 key购买 nike

我有一个看起来像这样的数据框:

country <- c("Canada", "US", "Japan", "China")
url <- c("http://en.wikipedia.org/wiki/United_States", "http://en.wikipedia.org/wiki/Canada",
"http://en.wikipedia.org/wiki/Japan", "http://en.wikipedia.org/wiki/China")
df <- data.frame(country, url)

country link
1 Canada http://en.wikipedia.org/wiki/United_States
2 US http://en.wikipedia.org/wiki/Canada
3 Japan http://en.wikipedia.org/wiki/Japan
4 China http://en.wikipedia.org/wiki/China

使用 rvest我想抓取每个 url 的目录并将它们绑定(bind)到一个输出。

此代码提取一个 url 的目录:
library(rvest)
toc <- html(url) %>%
html_nodes(".toctext") %>%
html_text()

期望的输出:
country toc
US Etymology
History
Native American and European contact
Settlements
...
Canada Etymology
History
Aboriginal peoples
European colonization
...etc

最佳答案

这会将它们刮成一个完整的数据框(每个 TOC 条目一行)。繁琐但直截了当的“打印/输出”代码留给 OP:

library(rvest)
library(dplyr)

country <- c("Canada", "US", "Japan", "China")
url <- c("http://en.wikipedia.org/wiki/United_States",
"http://en.wikipedia.org/wiki/Canada",
"http://en.wikipedia.org/wiki/Japan",
"http://en.wikipedia.org/wiki/China")
df <- data.frame(country, url)

bind_rows(lapply(url, function(x) {

data.frame(url=x, toc_entry=toc <- html(url[1]) %>%
html_nodes(".toctext") %>%
html_text())

})) -> toc_entries

df <- toc_entries %>% left_join(df)

df[sample(nrow(df), 10),]

## Source: local data frame [10 x 3]
##
## url toc_entry country
## 1 http://en.wikipedia.org/wiki/Japan Government finance Japan
## 2 http://en.wikipedia.org/wiki/Canada Cold War and civil rights era US
## 3 http://en.wikipedia.org/wiki/United_States Food Canada
## 4 http://en.wikipedia.org/wiki/Japan Sports Japan
## 5 http://en.wikipedia.org/wiki/Canada Religion US
## 6 http://en.wikipedia.org/wiki/China Cold War and civil rights era China
## 7 http://en.wikipedia.org/wiki/Japan Literature, philosophy, and the arts Japan
## 8 http://en.wikipedia.org/wiki/United_States Population Canada
## 9 http://en.wikipedia.org/wiki/Japan Settlements Japan
## 10 http://en.wikipedia.org/wiki/Canada Military US

关于r - 从 url 列表中获取 (rvest) 多个 HTML 页面,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/28906601/

28 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com