gpt4 book ai didi

r - 基于并发请求的 RCurl 爬虫问题

转载 作者:行者123 更新时间:2023-12-03 12:43:59 32 4
gpt4 key购买 nike

以下是一个脚本,用于重现我在使用执行并发请求的 RCurl 构建爬虫时所面临的问题。
目标是下载数千个网站的内容以进行统计分析。因此,解决方案应该扩展。

library(RCurl)
library(httr)

uris = c("inforapido.com.ar", "lm.facebook.com", "promoswap.enterfactory.com",
"p.brilig.com", "wap.renxo.com", "alamaula.com", "syndication.exoclick.com",
"mcp-latam.zed.com", "startappexchange.com", "fonts.googleapis.com",
"xnxx.com", "wv.inner-active.mobi", "canchallena.lanacion.com.ar",
"android.ole.com.ar", "livefyre.com", "fbapp://256002347743983/thread")

### RCurl Concurrent requests

getURIs <- function(uris, ..., multiHandle = getCurlMultiHandle(), .perform = TRUE){
content = list()
curls = list()
for(i in uris) {
curl = getCurlHandle()
content[[i]] = basicTextGatherer()
opts = curlOptions(URL = i, writefunction = content[[i]]$update,
timeout = 2, maxredirs = 3, verbose = TRUE,
followLocation = TRUE,...)
curlSetOpt(.opts = opts, curl = curl)
multiHandle = push(multiHandle, curl)
}
if(.perform) {
complete(multiHandle)
lapply(content, function(x) x$value())
} else {
return(list(multiHandle = multiHandle, content = content))
}
}

### Split uris in 3
uris_ls = split(uris, 1:3)

### retrieve content
uris_content <- list()
for(i in seq_along(uris_ls)){
uris_content[[i]] <- getURIs(uris_ls[[i]])
}

library(plyr)
a = lapply(uris_content, function(x) ldply(x, rbind))
result = ldply(a, rbind)
names(result) <- c('url', 'content')
result$number_char <- nchar(as.character(result$content))

### Here are examples of url that aren't working
url_not_working = result[result$number_char == 0, 1]

# url_not_working
# [1] "inforapido.com.ar" "canchallena.lanacion.com.ar" "fbapp://256002347743983/thread"
# [4] "xnxx.com" "startappexchange.com" "wv.inner-active.mobi"
# [7] "livefyre.com"

### Using httr GET it works fine

get_httr = GET(url_not_working[2])
content(g, 'text')

# The result is the same when using a single call
get_rcurl = getURL(url_not_working[2], encoding='UTF-8', timeout = 2,
maxredirs = 3, verbose = TRUE,
followLocation = TRUE)
get_rcurl

问题:

鉴于我需要抓取的网页数量,我宁愿为此使用 RCurl,因为它支持并发请求。
我想知道是否可以改进 getURLs() 调用以使其工作
在 getURL/getURLs 版本失败的情况下作为 GET() 版本。

更新:

我添加了一个包含更多数据(990 uris)的要点以更好地重现问题。
uris_ls <- dput() # dput() output found here: https://gist.github.com/martinbel/b4cc730b32914475ef0b

运行后:
uris_content <- list()
for(i in seq_along(uris_ls)){
uris_content[[i]] <- getURIs(uris_ls[[i]])
}

我收到以下错误:
Error in curlMultiPerform(obj) : embedded nul in string: 'GIF89a\001'
In addition: Warning message:
In strsplit(str, "\\\r\\\n") : input string 1 is invalid in this locale

使用 getURIAsynchronous:
  uris_content <- list()
for(i in seq_along(uris_ls)){
uris_content[[i]] <- getURIAsynchronous(uris_ls[[i]],
.opts=list(timeout = 2, maxredirs = 3, verbose = TRUE,
followLocation = TRUE))
}

我收到类似的错误:
nchar(str) 中的错误:无效的多字节字符串 1

更新 2
library(RCurl)
uris_ls <- dput() # dput() output found here: https://gist.github.com/martinbel/b4cc730b32914475ef0b

尝试以下操作后:
Sys.setlocale(locale="C")
uris_content <- list()
for(i in seq_along(uris_ls)){
uris_content[[i]] <- getURIAsynchronous(uris_ls[[i]],
.opts=list(timeout = 2, maxredirs = 3, verbose = TRUE,
followLocation = TRUE))
}

结果是它对前 225 个 URL 运行良好,然后它只从网站返回 cero 内容。这是空错误问题吗?
# This is a quick way to inspect the output:
nc = lapply(uris_content, nchar)
nc[[5]]
[1] 51422 0 16 19165 111763 6 14041 202 2485 0
[11] 78538 0 0 0 133253 42978 0 0 7880 33336
[21] 6762 194 93 0 0 0 0 0 9 0
[31] 165974 13222 22605 1392 0 42932 1421 0 0 0
[41] 0 13760 289 0 2674

nc[[6]]
[1] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
[39] 0 0 0 0 0 0 0

最佳答案

由于没有人回答,我提出了一个临时解决方案。如果 getURIAsynchronous 不起作用,只需使用 httr::GET 按顺序下载和 httr::content没有空字符串问题。

library(RCurl)
library(httr)

Sys.setlocale(locale="C")

opts = list(timeout = 2, maxredirs = 3,
verbose = TRUE, followLocation = TRUE)

try_asynch <- function(uris, .opts=opts){
getURIAsynchronous(uris, .opts=opts)
}

get_content <- function(uris){
cont <- try_asynch(uris)
nc <- lapply(content, nchar)
nc <- sapply(nc, function(x) ifelse(sum(x > 0), 1, 0))
if(sum(nc) < 10){
r <- lapply(uris, function(x) GET(x))
cont <- lapply(r, function(x) content(x, 'text'))
}
cont
}

docs <- lapply(uris_ls, get_content)

关于r - 基于并发请求的 RCurl 爬虫问题,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/26090304/

32 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com