r - 更快的替代 file.exists()-6ren

r - 更快的替代 file.exists()

转载作者：行者123 更新时间：2023-12-05 00:09:57

27

4

我维护了一个 R 包，它需要单独检查大量小文件的存在。多次调用file.exists()产生明显的缓慢(benchmarking results here)。不幸的是，情况限制使我无法调用 file.exists()一次以矢量化方式处理整批文件，我相信这会快得多。有没有更快的方法来检查单个文件是否存在？也许在C？这种方式在我的系统上似乎并没有更快(与产生 these benchmarks 的系统相同):

library(inline)
library(microbenchmark)

body <- "
  FILE *fp = fopen(CHAR(STRING_ELT(r_path, 0)), \"r\");
  SEXP result = PROTECT(allocVector(INTSXP, 1));
  INTEGER(result)[0] = fp == NULL? 0 : 1;
  UNPROTECT(1);
  return result;
"

file_exists_c <- cfunction(sig = signature(r_path = "character"), body = body)

tmp <- tempfile()

microbenchmark(
  c = file_exists_c(tmp),
  r = file.exists(tmp)
)
#> Unit: microseconds
#>  expr   min     lq    mean median     uq    max neval
#>     c 4.912 5.0230 5.42443 5.0605 5.1240 25.264   100
#>     r 3.972 4.0525 4.32615 4.1835 4.2675 11.750   100

file.create(tmp)
#> [1] TRUE

microbenchmark(
  c = file_exists_c(tmp),
  r = file.exists(tmp)
)
#> Unit: microseconds
#>  expr    min      lq     mean  median      uq    max neval
#>     c 16.212 16.6245 17.04727 16.7645 16.9860 32.207   100
#>     r  6.242  6.4175  7.16057  7.2830  7.4605 26.781   100

创建于 2019-12-06 由 reprex package (v0.3.0)

编辑: access() access()确实似乎更快，但不是很多。

library(inline)
library(microbenchmark)

body <- "
  SEXP result = PROTECT(allocVector(INTSXP, 1));
  INTEGER(result)[0] = access(CHAR(STRING_ELT(r_path, 0)), 0)? 0 : 1;
  UNPROTECT(1);
  return result;
"

file_exists_c <- cfunction(
  sig = signature(r_path = "character"),
  body = body,
  includes = "#include <unistd.h>"
)

tmp <- tempfile()

microbenchmark(
  c = file_exists_c(tmp),
  r = file.exists(tmp)
)
#> Unit: microseconds
#>  expr   min    lq    mean median     uq    max neval
#>     c 1.033 1.048 1.21334 1.0745 1.0910 13.793   100
#>     r 1.051 1.068 1.19280 1.0930 1.1175 10.048   100

file.create(tmp)
#> [1] TRUE

microbenchmark(
  c = file_exists_c(tmp),
  r = file.exists(tmp)
)
#> Unit: microseconds
#>  expr   min     lq    mean median     uq    max neval
#>     c 1.073 1.0910 1.33543 1.1285 1.1500 16.676   100
#>     r 1.172 1.1965 1.32934 1.2335 1.2695  9.916   100

创建于 2019-12-07 由 reprex package (v0.3.0)

最佳答案

这是file.exists的全部内容源代码(撰写本文时):

https://github.com/wch/r-source/blob/bfe73ecd848198cb9b68427cec7e70c40f96bd72/src/main/platform.c#L1375-L1404

SEXP attribute_hidden do_fileexists(SEXP call, SEXP op, SEXP args, SEXP rho)
{
    SEXP file, ans;
    int i, nfile;
    checkArity(op, args);
    if (!isString(file = CAR(args)))
    error(_("invalid '%s' argument"), "file");
    nfile = LENGTH(file);
    ans = PROTECT(allocVector(LGLSXP, nfile));
    for (i = 0; i < nfile; i++) {
    LOGICAL(ans)[i] = 0;
    if (STRING_ELT(file, i) != NA_STRING) {
#ifdef Win32
        /* Package XML sends arbitrarily long strings to file.exists! */
        size_t len = strlen(CHAR(STRING_ELT(file, i)));
        if (len > MAX_PATH)
        LOGICAL(ans)[i] = FALSE;
        else
        LOGICAL(ans)[i] =
            R_WFileExists(filenameToWchar(STRING_ELT(file, i), TRUE));
#else
        // returns NULL if not translatable
        const char *p = translateCharFP2(STRING_ELT(file, i));
        LOGICAL(ans)[i] = p && R_FileExists(p);
#endif
    } else LOGICAL(ans)[i] = FALSE;
    }
    UNPROTECT(1); /* ans */
    return ans;
}

至于 R_FileExists ，它在这里:

https://github.com/wch/r-source/blob/bfe73ecd848198cb9b68427cec7e70c40f96bd72/src/main/sysutils.c#L60-L79

#ifdef Win32
Rboolean R_FileExists(const char *path)
{
    struct _stati64 sb;
    return _stati64(R_ExpandFileName(path), &sb) == 0;
}
#else
Rboolean R_FileExists(const char *path)
{
    struct stat sb;
    return stat(R_ExpandFileName(path), &sb) == 0;
}

( R_ExpandFileName 只是在做 path.expand )。它依赖于 stat系统实用程序:

https://en.wikipedia.org/wiki/Stat_(system_call)

https://pubs.opengroup.org/onlinepubs/007908799/xsh/sysstat.h.html

它是为矢量化输入而构建的，因此如前所述，最好使用 file.exists(vector_of_files)比重复运行 file.exists(single_file) .

据我所知(诚然，我不是这里的系统实用程序方面的专家)，任何效率提升都以稳健性为代价。

关于r - 更快的替代 file.exists()，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/59222945/

27

4

0

文章推荐： python - 有没有办法在 Python-Pandas 中多重过滤 Dataframe？

文章推荐： android - 签名 APK 中的空响应 - 调试 APK 中的正确响应

文章推荐： android - 如何禁用文本输入中文本下的行 android react native

文章推荐： javascript - slim : $: 是什么意思？

file - access to file to files tomcat的conf文件夹下的一个文件
我想知道是否可以访问放在 tomcat 的 conf 文件夹中的文件。通常我会在这个文件中放置多个 webapp 的配置，在 war 之外。我想使用类路径独立于文件系统。我过去使用过 lib 文件
PowerShell ForEach $file in $Files 中的每个 $file
我有一个 PowerShell 脚本，它获取文件列表并移动满足特定条件的文件。为什么即使对象为空，foreach 循环也会运行？我假设如果 $i 不存在，它就不会运行。但是如果 $filePath
java - File file = new File () 的路径错误
我已将 BasicAccountRule.drl 放置在我的 Web 应用程序中，位置为:C:/workspace/exim_design/src/main/resources/rules/drl/i
ruby - File.open ('file.txt' ) 与 File.open ('file.txt' ).readlines
我使用 File.open('file.txt').class 和 File.open('file.txt').readlines.class 以及前者进行了检查一个返回 File，后者返回 Arra
java - 即使 file.exists()、file.canRead()、file.canWrite()、file.canExecute() 都返回 true，file.delete() 也会返回 false
我正在尝试使用 FileOutputStream 删除文件，在其中写入内容后。这是我用来编写的代码: private void writeContent(File file, String fileC
python - FileNotFoundException :File file:/path/to/file/in. txt不存在或者运行Flink的用户没有足够的权限访问它
我正在尝试使用 flink 和 python 批处理 api 测试 Wordcount 经典示例。我的问题是，将数据源从 env.from_elements() 修改为 env.read_text()
c - 通过函数 : FILE* or FILE**? 的 FILE* 数组
我正在尝试制作一个可以同时处理多个不同文件的程序。我的想法是制作一个包含 20 个 FILE* 的数组，以便在我达到此限制时能够关闭其中一个并打开请求的新文件。为此，我想到了一个函数，它选择一个选项
linux - 狂欢 : Search Contents of File A in File B and Print lines of File A in File C
我有两个文件A和B文件A: 976464 792992 文件B TimeStamp,Record1,976464,8383,ABCD 我想搜索文件 A 和文件 B 中的每条记录并打印匹配的记录。打印的
java - 使用 Java 8 流将 Map 转换为 Map>
我有一些保存在 map 中的属性文件。示例: Map map = new HashMap<>(); map.put("1", "One"); map.put("2", "Two"); map.put(
file - Unix/庆典 : Reading A List of Files and Merge Them To A File
我正在尝试找出一个脚本文件，该文件接受一个包含文件列表的文件(每一行都是一个文件路径，即 path/to/file)并将它们合并到一个文件中。例如: list.text -- path/to/fil
c# - File.CreateText/File.AppendText 与 File.AppendAllText
为了使用 File.CreateText() 和 File.AppendText() 你必须: 通过调用这些方法之一打开流写消息关闭流处理流为了使用 File.AppendAllText()
Using rsync to rename files during copying with --files-from?(在复制过程中使用rsync重命名文件--files-from？)
使用rsync时，如何在使用--files-from参数复制时重命名文件？我有大约190，000个文件，在从源复制到目标时，每个文件都需要重命名。我计划将文件列表放在一个文本文件中传递给--files
java - "file:d:\\dir1\file.xml"和 "file:/d:\\dir1\file.xml"作为 FileSystemXmlApplicationContext 参数
我在非服务器应用程序中使用 Spring(只需从 Eclipse 中某个类的 main() 编译并运行它)。我的问题是作为 new FileSystemXmlApplicationContext 的
ksh - "test -a file"和 "test file -ef file"的区别
QNX (Neutrino 6.5.0) 使用 ksh 的开源实现作为其 shell 。许多提供的脚本，包括系统启动脚本，都使用诸如 if ! test /dev/slog -ef /dev/slog
PHP : Excel cannot open the file because the file format or file extension is not valid
当我尝试打开从我的应用程序下载的 xls 文件时，出现此错误: excel cannot open the file because the file format or file extension
c - "file pointer"、 "stream"、 "file descriptor"和... "file"之间的区别？
有一些相关的概念，即文件指针、流和文件描述符。我知道文件指针是指向数据类型 FILE 的指针(在例如 FILE.h 和 struct_FILE.h 中声明)。我知道文件描述符是 int ，例如成员
file - Groovy(文件IO): find all files and return all files - the Groovy way
好吧，这应该很容易... 我是groovy的新手，我希望实现以下逻辑: def testFiles = findAllTestFiles(); 到目前为止，我想出了下面的代码，该代码可以成功打印所有文
PowerShell:为什么 "Get-Content | Out-File -Append "会进入循环？
我理解为什么以下内容会截断文件的内容: Get-Content | Out-File 这是因为 Out-File 首先运行，它会在 Get-Content 有机会读取文件之前清空文件。但是当我尝
file - 类型错误 : invalid file: When trying to make a file name a variable
您好，我正在尝试将文件位置表示为变量，因为最终脚本将在另一台机器上运行。这是我尝试过的代码，然后是我得到的错误。在我看来，python 是如何添加“\”的，这就是导致问题的原因。如果是这种情况，我如何
bash - 一行文件的 "$(cat file)"、 "$(
我有一个只包含一行的输入文件: $ cat input foo bar 我想在我的脚本中使用这一行，据我所知有 3 种方法: line=$(cat input) line=$( input"...,

首页

博学

6Ren·AI

商城

r - 更快的替代 file.exists()