css - 循环浏览字母页面 (rvest)-6ren

css - 循环浏览字母页面 (rvest)

转载作者：行者123 更新时间：2023-11-28 14:36:16

在这个问题上花了很多时间并查看了可用的答案之后，我想继续提出一个新问题来解决我使用 R 和 rvest 进行网络抓取的问题。我已尝试完全列出问题以尽量减少问题

问题我正在尝试从 session 网页中提取作者姓名。作者按姓氏字母顺序分开；因此，我需要使用 for 循环调用 follow_link() 25 次以转到每个页面并提取相关的作者文本。

session 网站: https://gsa.confex.com/gsa/2016AM/webprogram/authora.html

我使用 rvest 在 R 中尝试了两种解决方案，但都存在问题。

解决方案一(信函调用链接)

lttrs <- LETTERS[seq( from = 1, to = 26 )] # create character vector
website <-  html_session(https://gsa.confex.com/gsa/2016AM/webprogram/authora.html)

tempList <- list() #create list to store each page's author information

for(i in 1:length(lttrs)){
  tempList[[i]] <- website %>%
  follow_link(lttrs[i])%>% #use capital letters to call links to author pages  
  html_nodes(xpath ='//*[@class = "author"]') %>% 
  html_text()  
}

此代码在一定程度上有效。下面是输出。它将成功地浏览字母页面，直到 H-I 转换和 L-M 转换，此时它抓取了错误的页面。

Navigating to authora.html
Navigating to authorb.html
Navigating to authorc.html
Navigating to authord.html
Navigating to authore.html
Navigating to authorf.html
Navigating to authorg.html
Navigating to authorh.html
Navigating to authora.html
Navigating to authorj.html
Navigating to authork.html
Navigating to authorl.html
Navigating to http://community.geosociety.org/gsa2016/home

解决方案 2(CSS 调用链接)在页面上使用 CSS 选择器，每个带字母的页面都被标识为“a:nth-child(1-26)”。因此，我使用对该 CSS 标识符的调用重建了我的循环。

tempList <- list()
for(i in 2:length(lttrs)){
  tempList[[i]] <- website %>%
    follow_link(css = paste('a:nth-child(',i,')',sep = '')) %>%
    html_nodes(xpath ='//*[@class = "author"]') %>% 
    html_text()
}

这很有效有点。它再次遇到某些转换的问题(见下文)

Navigating to authora.html
Navigating to uploadlistall.html
Navigating to http://community.geosociety.org/gsa2016/home
Navigating to authore.html
Navigating to authorf.html
Navigating to authorg.html
Navigating to authorh.html
Navigating to authori.html
Navigating to authorj.html
Navigating to authork.html
Navigating to authorl.html
Navigating to authorm.html
Navigating to authorn.html
Navigating to authoro.html
Navigating to authorp.html
Navigating to authorq.html
Navigating to authorr.html
Navigating to authors.html
Navigating to authort.html
Navigating to authoru.html
Navigating to authorv.html
Navigating to authorw.html
Navigating to authorx.html
Navigating to authory.html
Navigating to authorz.html

具体来说，此方法遗漏了 B、C 和 D。在这一步循环到错误的页面。对于如何重新配置我的上述代码以正确循环所有 26 个字母页面的任何见解或指导，我将不胜感激。

非常感谢!

最佳答案

欢迎来到 SO(并赞扬 👍🏼 第一个问题)。

作为robots.txt，您似乎 super 幸运因为该站点有大量条目，但不会试图限制您正在做的事情。

我们可以使用 html_nodes(pg, "a[href^='author']")< 在页面底部的字母分页链接中提取所有 href/。以下是所有作者的所有论文链接:

library(rvest)
library(tidyverse)

pg <- read_html("https://gsa.confex.com/gsa/2016AM/webprogram/authora.html")

html_nodes(pg, "a[href^='author']") %>% 
  html_attr("href") %>% 
  sprintf("https://gsa.confex.com/gsa/2016AM/webprogram/%s", .) %>% 
  { pb <<- progress_estimated(length(.)) ; . } %>%  # we'll use a progress bar as this will take ~3m
  map_df(~{

    pb$tick()$print() # increment progress bar

    Sys.sleep(5) # PLEASE leave this in. It's rude to hammer a site without a crawl delay

    read_html(.x) %>% 
      html_nodes("div.item > div.author") %>% 
      map_df(~{
        data_frame(
          author = html_text(.x, trim = TRUE),
          paper = html_nodes(.x, xpath="../div[@class='papers']/a") %>% 
            html_text(trim = TRUE),
          paper_url = html_nodes(.x, xpath="../div[@class='papers']/a") %>% 
            html_attr("href") %>% 
            sprintf("https://gsa.confex.com/gsa/2016AM/webprogram/%s", .)
        )
      })
  }) -> author_papers

author_papers
## # A tibble: 34,983 x 3
##    author               paper  paper_url                                                    
##    <chr>                <chr>  <chr>                                                        
##  1 Aadahl, Kristopher   296-5  https://gsa.confex.com/gsa/2016AM/webprogram/Paper283542.html
##  2 Aanderud, Zachary T. 215-11 https://gsa.confex.com/gsa/2016AM/webprogram/Paper286442.html
##  3 Abbey, Alyssa        54-4   https://gsa.confex.com/gsa/2016AM/webprogram/Paper281801.html
##  4 Abbott, Dallas H.    341-34 https://gsa.confex.com/gsa/2016AM/webprogram/Paper287404.html
##  5 Abbott Jr., David M. 38-6   https://gsa.confex.com/gsa/2016AM/webprogram/Paper278060.html
##  6 Abbott, Grant        58-7   https://gsa.confex.com/gsa/2016AM/webprogram/Paper283414.html
##  7 Abbott, Jared        29-10  https://gsa.confex.com/gsa/2016AM/webprogram/Paper286237.html
##  8 Abbott, Jared        317-9  https://gsa.confex.com/gsa/2016AM/webprogram/Paper282386.html
##  9 Abbott, Kathryn A.   187-9  https://gsa.confex.com/gsa/2016AM/webprogram/Paper286127.html
## 10 Abbott, Lon D.       208-16 https://gsa.confex.com/gsa/2016AM/webprogram/Paper280093.html
## # ... with 34,973 more rows

我不知道您需要从各个纸页中获取什么，所以您可以这样做。

您也不必等待 ~3 分钟，因为 author_papers 数据框位于此 RDS 文件中:https://rud.is/dl/author-papers.rds您可以阅读:

readRDS(url("https://rud.is/dl/author-papers.rds"))

如果您确实计划抓取 34,983 篇论文，那么请继续注意“不要粗鲁”并使用抓取延迟(引用:https://rud.is/b/2017/07/28/analyzing-wait-delay-settings-in-common-crawl-robots-txt-data-with-r/)。

更新

html_nodes(pg, "a[href^='author']") %>% 
  html_attr("href") %>% 
  sprintf("https://gsa.confex.com/gsa/2016AM/webprogram/%s", .) %>% 
  { pb <<- progress_estimated(length(.)) ; . } %>%  # we'll use a progress bar as this will take ~3m
  map_df(~{

    pb$tick()$print() # increment progress bar

    Sys.sleep(5) # PLEASE leave this in. It's rude to hammer a site without a crawl delay

    read_html(.x) %>% 
      html_nodes("div.item > div.author") %>% 
      map_df(~{
        data_frame(
          author = html_text(.x, trim = TRUE),
          is_presenting = html_nodes(.x, xpath="../div[@class='papers']") %>% 
            html_text(trim = TRUE) %>% # retrieve the text of all the "papers"
            paste0(collapse=" ") %>% # just in case there are multiple nodes we flatten them into one
            grepl("*", ., fixed=TRUE) # make it TRUE if we find the "*" 
        )
      })
  }) -> author_with_presenter_status

author_with_presenter_status
## # A tibble: 22,545 x 2
##    author               is_presenting
##    <chr>                <lgl>        
##  1 Aadahl, Kristopher   FALSE        
##  2 Aanderud, Zachary T. FALSE        
##  3 Abbey, Alyssa        TRUE         
##  4 Abbott, Dallas H.    FALSE        
##  5 Abbott Jr., David M. TRUE         
##  6 Abbott, Grant        FALSE        
##  7 Abbott, Jared        FALSE        
##  8 Abbott, Kathryn A.   FALSE        
##  9 Abbott, Lon D.       FALSE        
## 10 Abbott, Mark B.      FALSE        
## # ... with 22,535 more rows

您还可以通过以下方式检索:

readRDS(url("https://rud.is/dl/author-presenter.rds"))

关于css - 循环浏览字母页面 (rvest)，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/53468576/

文章推荐： javascript - 动态改变条形颜色

文章推荐： javascript - Offcanvas 菜单背景不透明

文章推荐： css - 菜单显示问题

php - for 循环 vs while 循环 vs foreach 循环 PHP
我是 PHP 新手。我一直在脚本中使用 for 循环、while 循环、foreach 循环。我想知道哪个性能更好？选择循环的标准是什么？当我们在另一个循环中循环时应该使用哪个？我一直想知道要
java - 编写 for 循环/while 循环？
我在高中的编程课上，我的作业是制作一个基本的小计和顶级计算器，但我在一家餐馆工作，所以制作一个只能让你在一种食物中读到。因此，我尝试让它能够接收多种食品并将它们添加到一个价格变量中。抱歉，如果某些代码
javascript - 为成分编写 while 循环/for 循环。
这是我正在学习的一本教科书。 var ingredients = ["eggs", "milk", "flour", "sugar", "baking soda", "baking powder",
Javascript 添加前导零适用于 while 循环，但不适用于 for 循环
我正在从字符串中提取数字并将其传递给函数。我想给它加 1，然后返回字符串，同时保留前导零。我可以使用 while 循环来完成此操作，但不能使用 for 循环。 for 循环只是跳过零。 var add
java - 程序适用于 for 循环，但不适用于 while 循环？
编辑:我已经在程序的输出中进行了编辑。该程序要求估计给定值 mu。用户给出一个值 mu，同时还提供了四个不等于 1 的不同数字(称为 w、x、y、z)。然后，程序尝试使用 de Jaeger 公式找
Java For 循环 vs While 循环，奇怪的行为和时间性能
我正在编写一个算法，该算法对一个整数数组从末尾到开头执行一个大循环，其中包含一个 if 条件。第一次条件为假时，循环可以终止。因此，对于 for 循环，如果条件为假，它会继续迭代并进行简单的变量更改
java - While 循环 vs For 循环，哪个更节省内存!
现在我已经习惯了在内存非常有限的情况下进行编程，但我没有答案的一个问题是:哪个内存效率更高；- for(;;) 或 while() ？还是它们可以平等互换？如果有的话，还要对效率问题发表评论! 最佳答
java - 一个 while 循环，其中包含一个 if 语句和一个 for 循环
这个问题已经有答案了: How do I compare strings in Java? (23 个回答) 已关闭 8 年前。我正在尝试创建一个小程序，我可以在其中读取该程序的单词。如果单词有 6
python - 弹出索引超出范围 - 作业(列表，for 循环，while 循环)
这个问题在这里已经有了答案: python : list index out of range error while iteratively popping elements (12 个答案) 关
java - JOptionPane.showInputDialog 循环(使用 do while 循环)
我正在尝试向用户请求 4 到 10 之间的整数。如果他们回答超出该范围，它将进入循环。当用户第一次正确输入数字时，它不会中断并继续执行 else 语句。如果用户在 else 语句中正确输入数字，它将正
php - 嵌套的 foreach 循环，break inside 循环
我尝试创建一个带有嵌套 foreach 循环的列表。第一个循环是循环一些数字，第二个循环是循环日期。我想给一个日期写一个数字。所以还有另一个功能来检查它。但结果是数字多次写入日期。 Out 是这样的:
java - 在 while 循环(或 for 循环)内创建一个数组，然后在外部使用该数组
我想要做的事情是使用循环创建一个数组，然后在另一个类中调用该数组，这不会做，也可能永远不会做。解决这个问题最好的方法是什么？我已经寻找了所有解决方案，但它们无法编译。感谢您的帮助。 import ja
php - 嵌套的 foreach 循环，break inside 循环
我尝试创建一个带有嵌套 foreach 循环的列表。第一个循环是循环一些数字，第二个循环是循环日期。我想给一个日期写一个数字。所以还有另一个功能来检查它。但结果是数字多次写入日期。 Out 是这样的:
c - 如何将 'convert' 两个(for 循环)转为一个(while 循环)？
我正在模拟一家快餐店三个多小时。这三个小时分为 18 个间隔，每个间隔 600 秒。每个间隔都会输出有关这 600 秒内发生的情况的统计信息。我原来的结构是这样的: int i; for (i=0;
javascript - ie javascript for in 循环 vs chrome for in 循环
这个问题已经有答案了: IE8 for...in enumerator (3 个回答) How do I check if an object has a specific property in J
java - 编程语言中的 for 循环 VS while 循环，c++/java？
哪个对性能更好？这可能与其他编程语言不一致，所以如果它们不同，或者如果你能用你对特定语言的知识回答我的问题，请解释。我将使用 c++ 作为示例，但我想知道它在 java、c 或任何其他主流语言中的工
c++ - C++11 段错误中基于范围的 for 循环，但不是常规 for 循环
这个问题不太可能帮助任何 future 的访问者；它只与一个小的地理区域、一个特定的时间点或一个非常狭窄的情况有关，这些情况并不普遍适用于互联网的全局受众。为了帮助使这个问题更广泛地适用，visit
c - while 循环(和 for 循环)上的 scanf 错误，永远扫描
我是 C 编程和编写代码的新手，以确定 M 测试用例的质因数分解。如果我一次只扫描一次，该功能本身就可以工作，但是当我尝试执行 M 次时却惨遭失败。我不知道为什么 scanf() 循环有问题。 in
javascript - 进行修改时应出现 'for-of' 循环，而不是 'for' 循环
这个问题已经有答案了: JavaScript by reference vs. by value [duplicate] (4 个回答) 已关闭 3 年前。我在使用 TSlint 时遇到问题，并且理
javascript - 为 Charts.js 添加 for 循环/foreach 循环
我尝试在下面的代码中添加 foreach 或 for 循环，以便为 Charts.js 创建多个数据集。这将允许我在此折线图上创建多条线。我有一个 PHP 对象，我可以对其进行编码以稍后填充变量，但

行者123

个人简介

我是一名优秀的程序员,十分优秀！

作者热门文章

滴滴打车优惠券免费领取

全站热门文章

首页

博学

6Ren·AI

商城

css - 循环浏览字母页面 (rvest)