r - 使用tryCatch和rvest处理404等爬行错误-6ren

r - 使用tryCatch和rvest处理404等爬行错误

转载作者：行者123 更新时间：2023-12-01 18:35:13

25

4

当使用 rvest 检索 h1 标题时，我有时会遇到 404 页面。这会停止进程并返回此错误。

Error in open.connection(x, "rb") : HTTP error 404.

请参阅下面的示例

Data<-data.frame(Pages=c(
"http://boingboing.net/2016/06/16/spam-king-sanford-wallace.html",
"http://boingboing.net/2016/06/16/omg-the-japanese-trump-commer.html",
"http://boingboing.net/2016/06/16/omar-mateen-posted-to-facebook.html",
"http://boingboing.net/2016/06/16/omar-mateen-posted-to-facdddebook.html"))

用于检索 h1 的代码

library (rvest)
sapply(Data$Pages, function(url){
 url %>%
 as.character() %>% 
 read_html() %>% 
 html_nodes('h1') %>% 
 html_text()
 })

有没有办法包含一个参数来忽略错误并继续该过程？

最佳答案

您正在寻找 try 或 tryCatch，这是 R 处理错误捕获的方式。

使用try，你只需要把可能失败的东西包装在try()中，它就会返回错误并继续运行:

library(rvest)

sapply(Data$Pages, function(url){
  try(
    url %>%
      as.character() %>% 
      read_html() %>% 
      html_nodes('h1') %>% 
      html_text()
  )
})

# [1] "'Spam King' Sanford Wallace gets 2.5 years in prison for 27 million Facebook scam messages"
# [2] "OMG, this Japanese Trump Commercial is everything"                                         
# [3] "Omar Mateen posted to Facebook during Orlando mass shooting"                               
# [4] "Error in open.connection(x, \"rb\") : HTTP error 404.\n"

然而，虽然这会得到一切，但它也会将错误的数据插入到我们的结果中。 tryCatch 允许您通过向错误传递一个在出现错误时运行的函数来配置调用错误时发生的情况:

sapply(Data$Pages, function(url){
  tryCatch(
    url %>%
      as.character() %>% 
      read_html() %>% 
      html_nodes('h1') %>% 
      html_text(), 
    error = function(e){NA}    # a function that returns NA regardless of what it's passed
  )
})

# [1] "'Spam King' Sanford Wallace gets 2.5 years in prison for 27 million Facebook scam messages"
# [2] "OMG, this Japanese Trump Commercial is everything"                                         
# [3] "Omar Mateen posted to Facebook during Orlando mass shooting"                               
# [4] NA

我们开始了；好多了。

<小时/>

更新

在 tidyverse 中，purrr 包提供了两个函数，safely 和 possously，其工作方式类似于 try和tryCatch。它们是副词，而不是动词，这意味着它们接受一个函数，修改它以处理错误，并返回一个可以调用的新函数(不是数据对象)。示例:

library(tidyverse)
library(rvest)

df <- Data %>% rowwise() %>%     # Evaluate each row (URL) separately
    mutate(Pages = as.character(Pages),    # Convert factors to character for read_html
           title = possibly(~.x %>% read_html() %>%    # Try to take a URL, read it,
                                html_nodes('h1') %>%    # select header nodes,
                                html_text(),    # and collect text inside.
                            NA)(Pages))    # If error, return NA. Call modified function on URLs.

df %>% select(title)
## Source: local data frame [4 x 1]
## Groups: <by row>
## 
## # A tibble: 4 × 1
##                                                                                        title
##                                                                                        <chr>
## 1 'Spam King' Sanford Wallace gets 2.5 years in prison for 27 million Facebook scam messages
## 2                                          OMG, this Japanese Trump Commercial is everything
## 3                                Omar Mateen posted to Facebook during Orlando mass shooting
## 4                                                                                       <NA>

关于r - 使用tryCatch和rvest处理404等爬行错误，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/38114066/

25

4

0

文章推荐： vb.net - 两台不同计算机上的不同互操作引用不起作用

文章推荐： java - 你能/如何通过明智的选择来节省CPU和内存

文章推荐： ios - 未在 viewController 中执行 Segue

文章推荐： java - 霍夫曼编码的消息如何通过线路传输？

r - R tryCatch block 中的变量范围 : is <<- necessary to change local variable defined before tryCatch?
考虑以下代码: test1 print(test1) [1] "b" > print(test2) [1] "b" 最佳答案 '<<-' 是为不属于 R 的副作用而设计的。永远不要使用它，或者只有在
r - tryCatch() 显然忽略了警告
我正在编写一个函数，该函数使用 kmeans 来确定 bin 宽度，以将连续测量值(预测概率)转换为整数(3 个 bin 之一)。我偶然发现了一个边缘情况，在这种情况下，我的算法可以(正确)预测整个集
r - tryCatch 未捕获错误并跳过错误参数
我注意到 tryCatch 没有正确捕获以下错误:它不打印 TRUE，并且它不转到浏览器... 它可能是 tryCatch 函数中的错误吗？ library(formattable) df1 = st
r - tryCatch - 命名空间？
我对 R 很陌生，我对 tryCatch 的正确用法感到困惑.我的目标是对大型数据集进行预测。如果预测不适合内存，我想通过拆分我的数据来规避这个问题。现在，我的代码大致如下: tryCatch({
r tryCatch 如何将对象传递给错误函数
myFunc <- function(x) { x <- timeSeries(x, charvec=as.Date(index(x))) t<-tryCatch( doSomething(
R tryCatch 跳过错误
我尝试连接 2 个数据框:Eset2 和 Essential。它们共享 1 个包含基因名称的公共(public)列，并且两个框架都有唯一的行。所以我决定在 Eset2 中查找我需要的值(RMA，AN
R tryCatch 处理一种错误
我想知道这是检查 tryCatch 函数类型的错误或警告的方法，例如在 Java 中。 try { driver.findElement(By.xpath(locator)).
r - tryCatch 似乎没有返回我的变量
我正在尝试使用 tryCatch 生成 p 值列表，矩阵中有几行没有足够的观察值来进行 t 检验。这是我到目前为止生成的代码: pValues <- c() for(i in row.names(co
R:For循环如果错误则跳过/tryCatch
我有一个 1000 行的数据框。我想要循环的代码非常简单 - 我只想将第 4 列中的所有值设为大写。我希望它能够在任何行中出现错误时跳过该行并继续执行其余行。我写了这段代码: for(i in 1:
r - tryCatch 错误
我有一段代码，其中使用 for 循环读取和分析文件列表。由于我必须分析多个文件，因此我想使用 tryCatch 来打印引发问题的文件的名称。我的文件的一个常见问题是缺少列名，我的意思是，它应该在文件中
r - tryCatch 错误范围{}
这将输出“未发现错误!”两次， x .8) stop("oops") TRUE } g = function() { ## on error, warn user but contin
R, tryCatch 错误
我正在解析大量网站并编写了一个脚本，该脚本循环遍历来自单独文件的数千个链接。但是，我遇到过有时 R 无法加载一个链接，它会在循环中间停止，从而导致许多其他 url 无法解析。所以我尝试使用 tryCa
R:For循环如果错误则跳过/tryCatch
我有一个 1000 行的数据框。我想要循环的代码非常简单 - 我只想将第 4 列中的所有值设为大写。我希望它能够在任何行中出现错误时跳过该行并继续执行其余行。我写了这段代码: for(i in 1:
r - tryCatch 抑制错误消息
我正在使用 tryCatch捕捉发生的任何错误。但是，即使我捕获它们并返回适当的错误值，看起来我的批处理系统的日志中仍然报告错误。有没有办法完全抑制错误并简单地继续我提供的错误处理？最佳答案确保您
r - tryCatch 错误
我有一段代码，其中使用 for 循环读取和分析文件列表。由于我必须分析多个文件，因此我想使用 tryCatch 来打印引发问题的文件的名称。我的文件的一个常见问题是缺少列名，我的意思是，它应该在文件中
Java - 在方法外使用 TryCatch
import org.newdawn.slick.Image; import org.newdawn.slick.SlickException; public class Images { t
javascript - TryCatch 装饰器没有捕获错误
下面的tryCatch装饰器无法捕捉到错误。 const TryCatchWrapper = (target, key, descriptor) => { const fn = descripto
r - tryCatch 函数中的 while 循环
我在排列数据上运行 GLMM，对于其中一些我有收敛的错误消息。由于这是我的空模型，我只需要重新采样这个特定的排列数据。因此，我试图处理 R 的 tryCatch 函数，但我有一些失败。我有 Pe
r - 在 tryCatch 中处理多个可能的错误
我试图在 for 循环中处理两个可能的错误，它调用 stplanr::dist_google与 API 交互。我知道错误，所以我想在它们发生时采取具体的行动。如果我尝试仅处理可能的错误之一，它会起作
r - 嵌套的 tryCatch 没有捕捉到错误？
我有一个函数: buggy buggy() Error in tryCatchList(expr, classes, parentenv, handlers) : I don't like gr

首页

博学

6Ren·AI

商城

r - 使用tryCatch和rvest处理404等爬行错误

更新