gpt4 book ai didi

css - 循环浏览字母页面 (rvest)

转载 作者:行者123 更新时间:2023-11-28 14:36:16 28 4
gpt4 key购买 nike

在这个问题上花了很多时间并查看了可用的答案之后,我想继续提出一个新问题来解决我使用 R 和 rvest 进行网络抓取的问题。我已尝试完全列出问题以尽量减少问题

问题我正在尝试从 session 网页中提取作者姓名。作者按姓氏字母顺序分开;因此,我需要使用 for 循环调用 follow_link() 25 次以转到每个页面并提取相关的作者文本。

session 网站: https://gsa.confex.com/gsa/2016AM/webprogram/authora.html

我使用 rvest 在 R 中尝试了两种解决方案,但都存在问题。

解决方案一(信函调用链接)

lttrs <- LETTERS[seq( from = 1, to = 26 )] # create character vector
website <- html_session(https://gsa.confex.com/gsa/2016AM/webprogram/authora.html)

tempList <- list() #create list to store each page's author information

for(i in 1:length(lttrs)){
tempList[[i]] <- website %>%
follow_link(lttrs[i])%>% #use capital letters to call links to author pages
html_nodes(xpath ='//*[@class = "author"]') %>%
html_text()
}

此代码在一定程度上有效。下面是输出。它将成功地浏览字母页面,直到 H-I 转换和 L-M 转换,此时它抓取了错误的页面。

Navigating to authora.html
Navigating to authorb.html
Navigating to authorc.html
Navigating to authord.html
Navigating to authore.html
Navigating to authorf.html
Navigating to authorg.html
Navigating to authorh.html
Navigating to authora.html
Navigating to authorj.html
Navigating to authork.html
Navigating to authorl.html
Navigating to http://community.geosociety.org/gsa2016/home

解决方案 2(CSS 调用链接)在页面上使用 CSS 选择器,每个带字母的页面都被标识为“a:nth-child(1-26)”。因此,我使用对该 CSS 标识符的调用重建了我的循环。

tempList <- list()
for(i in 2:length(lttrs)){
tempList[[i]] <- website %>%
follow_link(css = paste('a:nth-child(',i,')',sep = '')) %>%
html_nodes(xpath ='//*[@class = "author"]') %>%
html_text()
}

这很有效有点。它再次遇到某些转换的问题(见下文)

Navigating to authora.html
Navigating to uploadlistall.html
Navigating to http://community.geosociety.org/gsa2016/home
Navigating to authore.html
Navigating to authorf.html
Navigating to authorg.html
Navigating to authorh.html
Navigating to authori.html
Navigating to authorj.html
Navigating to authork.html
Navigating to authorl.html
Navigating to authorm.html
Navigating to authorn.html
Navigating to authoro.html
Navigating to authorp.html
Navigating to authorq.html
Navigating to authorr.html
Navigating to authors.html
Navigating to authort.html
Navigating to authoru.html
Navigating to authorv.html
Navigating to authorw.html
Navigating to authorx.html
Navigating to authory.html
Navigating to authorz.html

具体来说,此方法遗漏了 B、C 和 D。在这一步循环到错误的页面。对于如何重新配置​​我的上述代码以正确循环所有 26 个字母页面的任何见解或指导,我将不胜感激。

非常感谢!

最佳答案

欢迎来到 SO(并赞扬 👍🏼 第一个问题)。

作为robots.txt,您似乎 super 幸运因为该站点有大量条目,但不会试图限制您正在做的事情。

我们可以使用 html_nodes(pg, "a[href^='author']")< 在页面底部的字母分页链接中提取所有 href/。以下是所有作者的所有论文链接:

library(rvest)
library(tidyverse)

pg <- read_html("https://gsa.confex.com/gsa/2016AM/webprogram/authora.html")

html_nodes(pg, "a[href^='author']") %>%
html_attr("href") %>%
sprintf("https://gsa.confex.com/gsa/2016AM/webprogram/%s", .) %>%
{ pb <<- progress_estimated(length(.)) ; . } %>% # we'll use a progress bar as this will take ~3m
map_df(~{

pb$tick()$print() # increment progress bar

Sys.sleep(5) # PLEASE leave this in. It's rude to hammer a site without a crawl delay

read_html(.x) %>%
html_nodes("div.item > div.author") %>%
map_df(~{
data_frame(
author = html_text(.x, trim = TRUE),
paper = html_nodes(.x, xpath="../div[@class='papers']/a") %>%
html_text(trim = TRUE),
paper_url = html_nodes(.x, xpath="../div[@class='papers']/a") %>%
html_attr("href") %>%
sprintf("https://gsa.confex.com/gsa/2016AM/webprogram/%s", .)
)
})
}) -> author_papers

author_papers
## # A tibble: 34,983 x 3
## author paper paper_url
## <chr> <chr> <chr>
## 1 Aadahl, Kristopher 296-5 https://gsa.confex.com/gsa/2016AM/webprogram/Paper283542.html
## 2 Aanderud, Zachary T. 215-11 https://gsa.confex.com/gsa/2016AM/webprogram/Paper286442.html
## 3 Abbey, Alyssa 54-4 https://gsa.confex.com/gsa/2016AM/webprogram/Paper281801.html
## 4 Abbott, Dallas H. 341-34 https://gsa.confex.com/gsa/2016AM/webprogram/Paper287404.html
## 5 Abbott Jr., David M. 38-6 https://gsa.confex.com/gsa/2016AM/webprogram/Paper278060.html
## 6 Abbott, Grant 58-7 https://gsa.confex.com/gsa/2016AM/webprogram/Paper283414.html
## 7 Abbott, Jared 29-10 https://gsa.confex.com/gsa/2016AM/webprogram/Paper286237.html
## 8 Abbott, Jared 317-9 https://gsa.confex.com/gsa/2016AM/webprogram/Paper282386.html
## 9 Abbott, Kathryn A. 187-9 https://gsa.confex.com/gsa/2016AM/webprogram/Paper286127.html
## 10 Abbott, Lon D. 208-16 https://gsa.confex.com/gsa/2016AM/webprogram/Paper280093.html
## # ... with 34,973 more rows

我不知道您需要从各个纸页中获取什么,所以您可以这样做。

您也不必等待 ~3 分钟,因为 author_papers 数据框位于此 RDS 文件中:https://rud.is/dl/author-papers.rds您可以阅读:

readRDS(url("https://rud.is/dl/author-papers.rds"))

如果您确实计划抓取 34,983 篇论文,那么请继续注意“不要粗鲁”并使用抓取延迟(引用:https://rud.is/b/2017/07/28/analyzing-wait-delay-settings-in-common-crawl-robots-txt-data-with-r/)。

更新

html_nodes(pg, "a[href^='author']") %>% 
html_attr("href") %>%
sprintf("https://gsa.confex.com/gsa/2016AM/webprogram/%s", .) %>%
{ pb <<- progress_estimated(length(.)) ; . } %>% # we'll use a progress bar as this will take ~3m
map_df(~{

pb$tick()$print() # increment progress bar

Sys.sleep(5) # PLEASE leave this in. It's rude to hammer a site without a crawl delay

read_html(.x) %>%
html_nodes("div.item > div.author") %>%
map_df(~{
data_frame(
author = html_text(.x, trim = TRUE),
is_presenting = html_nodes(.x, xpath="../div[@class='papers']") %>%
html_text(trim = TRUE) %>% # retrieve the text of all the "papers"
paste0(collapse=" ") %>% # just in case there are multiple nodes we flatten them into one
grepl("*", ., fixed=TRUE) # make it TRUE if we find the "*"
)
})
}) -> author_with_presenter_status

author_with_presenter_status
## # A tibble: 22,545 x 2
## author is_presenting
## <chr> <lgl>
## 1 Aadahl, Kristopher FALSE
## 2 Aanderud, Zachary T. FALSE
## 3 Abbey, Alyssa TRUE
## 4 Abbott, Dallas H. FALSE
## 5 Abbott Jr., David M. TRUE
## 6 Abbott, Grant FALSE
## 7 Abbott, Jared FALSE
## 8 Abbott, Kathryn A. FALSE
## 9 Abbott, Lon D. FALSE
## 10 Abbott, Mark B. FALSE
## # ... with 22,535 more rows

您还可以通过以下方式检索:

readRDS(url("https://rud.is/dl/author-presenter.rds"))

关于css - 循环浏览字母页面 (rvest),我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/53468576/

28 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com