gpt4 book ai didi

r - 在 R 中下载多个文件的更快方法

转载 作者:行者123 更新时间:2023-12-05 01:23:24 25 4
gpt4 key购买 nike

我用 R 编写了一个小型下载器,以便一次性从远程服务器下载一些日志文件:

file_remote <- fun_to_list_URLs()
file_local <- fun_to_gen_local_paths()
credentials <- "usr/pwd"

downloader <- function(file_remote, file_local, credentials) {
data_bin <- RCurl::getBinaryURL(
file_remote,
userpwd = credentials,
ftp.use.epsv = FALSE,
forbid.reuse = TRUE
)

writeBin(data_bin, file_local)
}

purrr::walk2(
file_remote,
file_local,
~ downloader(
file_remote = .x,
file_local = .y,
credentials = credentials
)
)

这有效,但速度很慢,特别是与某些 FTP 客户端(如 WinSCP)相比,下载 64 个日志文件,每个 2kb,需要几分钟。

在 R 中下载大量文件是否有更快的方法?

最佳答案

curl 包有一种执行异步请求的方法,这意味着下载是同时执行的,而不是一个接一个地执行。特别是对于较小的文件,这应该可以大大提高性能。这是执行此操作的准系统功能

# total_con: max total concurrent connections.
# host_con: max concurrent connections per host.
# print: print status of requests at the end.
multi_download <- function(file_remote,
file_local,
total_con = 1000L,
host_con = 1000L,
print = TRUE) {

# check for duplication (deactivated for testing)
# dups <- duplicated(file_remote) | duplicated(file_local)
# file_remote <- file_remote[!dups]
# file_local <- file_local[!dups]

# create pool
pool <- curl::new_pool(total_con = total_con,
host_con = host_con)

# function performed on successful request
save_download <- function(req) {
writeBin(req$content, file_local[file_remote == req$url])
}

# setup async calls
invisible(
lapply(
file_remote, function(f)
curl::curl_fetch_multi(f, done = save_download, pool = pool)
)
)

# all created requests are performed here
out <- curl::multi_run(pool = pool)

if (print) print(out)

}

现在我们需要一些测试文件来将其与您的基准方法进行比较。我使用来自约翰霍普金斯大学 GitHub 页面的 covid 数据,因为它包含许多应该与您的文件类似的小 csv 文件。

file_remote <- paste0(
"https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_daily_reports/",
format(seq(as.Date("2020-03-03"), as.Date("2022-06-01"), by = "day"), "%d-%m-%Y"),
".csv"
)
file_local <- paste0("/home/johannes/Downloads/test/", seq_along(file_remote), ".bin")

我们也可以从 URL 中推断出文件名,但我认为这不是您想要的。那么现在让我们比较这 821 个文件的方法:

res <- bench::mark(
baseline(),
multi_download(file_remote,
file_local,
print = FALSE),
check = FALSE
)
#> Warning: Some expressions had a GC in every iteration; so filtering is disabled.
summary(res)
#> Warning: Some expressions had a GC in every iteration; so filtering is disabled.
#> # A tibble: 2 × 6
#> expression min median `itr/sec`
#> <bch:expr> <bch:> <bch:> <dbl>
#> 1 baseline() 2.8m 2.8m 0.00595
#> 2 multi_download(file_remote, file_local, print = FALSE) 12.7s 12.7s 0.0789
#> # … with 2 more variables: mem_alloc <bch:byt>, `gc/sec` <dbl>
summary(res, relative = TRUE)
#> Warning: Some expressions had a GC in every iteration; so filtering is disabled.
#> # A tibble: 2 × 6
#> expression min median `itr/sec`
#> <bch:expr> <dbl> <dbl> <dbl>
#> 1 baseline() 13.3 13.3 1
#> 2 multi_download(file_remote, file_local, print = FALSE) 1 1 13.3
#> # … with 2 more variables: mem_alloc <dbl>, `gc/sec` <dbl>

新方法比原来的方法快 13.3 倍。我认为您拥有的文件越多,差异就越大。但请注意,该基准测试并不完美,因为我的网速波动很大。

该功能还应该在处理错误方面得到改进(目前您会收到一条消息,有多少请求已成功,有多少请求出错,但没有指示存在哪些文件)。我的理解也是multi_runsave_download将文件写入磁盘之前将文件写入内存。对于小文件,这很好,但对于较大的文件可能会出现问题。

基线函数

baseline <- function() {
credentials <- "usr/pwd"
downloader <- function(file_remote, file_local, credentials) {
data_bin <- RCurl::getBinaryURL(
file_remote,
userpwd = credentials,
ftp.use.epsv = FALSE,
forbid.reuse = TRUE
)
writeBin(data_bin, file_local)
}

purrr::walk2(
file_remote,
file_local,
~ downloader(
file_remote = .x,
file_local = .y,
credentials = credentials
)
)
}

reprex package 创建于 2022-06-05 (v2.0.1)

关于r - 在 R 中下载多个文件的更快方法,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/72380712/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com