gpt4 book ai didi

r - 读取() : reading table with\r\r\n as newline symbol

转载 作者:行者123 更新时间:2023-12-01 21:15:59 27 4
gpt4 key购买 nike

我在文本文件中有制表符分隔的表格,其中所有行均以 \r\r\n (0x0D 0x0D 0x0A) 结尾。如果我尝试使用 fread() 读取此类文件,它会显示

Line ending is \r\r\n. R's download.file() appears to add the extra \r in text mode on Windows. Please download again in binary mode (mode='wb') which might be faster too. Alternatively, pass the URL directly to fread and it will download the file in binary mode for you.

但我没有下载这些文件,我已经有了它们。

到目前为止,我找到了首先使用 read.table() 读取文件的解决方案(它将 \r\r\n 组合视为单个端点 -行外字符),然后通过 data.table() 转换生成的 data.frame:

mydt <- data.table(read.table(myfilename, header = T, sep = '\t', fill = T))

但我想知道是否有任何方法可以避免缓慢的 read.table() 并使用快速的 fread()

最佳答案

我建议使用 GNU 实用程序 tr摆脱那些不必要的\r人物。例如

cat("a,b,c\r\r\n1, 2, 3\r\r\n4, 5, 6", file = "test.csv")
fread("test.csv")
## Error in fread("test.csv") :
## Line ending is \r\r\n. R's download.file() appears to add the extra \r in text mode on Windows. Please download again in binary mode (mode='wb') which might be faster too. Alternatively, pass the URL directly to fread and it will download the file in binary mode for you.

system("tr -d '\r' < test.csv > test2.csv")
fread("test2.csv")
## a b c
## 1: 1 2 3
## 2: 4 5 6

如果您使用的是 Windows 并且没有 tr实用程序,你可以得到它here .

已添加:

我使用 100,000 x 5 样本 cvs 数据集对三种方法进行了一些比较。

  • OPcsv就是“慢”read.table方法
  • freadScan是一种丢弃额外的 \r 的方法纯 R 中的字符
  • freadtr调用 GNU tr通过 shell 使用 fread()直接地。

第三种方法是迄今为止最快的。

# create a 100,000 x 5 sample dataset with lines ending in \r\r\n
delim <- "\r\r\n"
sample.txt <- paste0("a, b, c, d, e", delim)
for (i in 1:100000) {
sample.txt <- paste0(sample.txt,
paste(round(runif(5)*100), collapse = ","),
delim)
}
cat(sample.txt, file = "sample.csv")


# function that translates the extra \r characters in R only
fread2 <- function(filename) {
tmp <- scan(file = filename, what = "character", quiet = TRUE)
# remove empty lines caused by \r
tmp <- tmp[tmp != ""]
# paste lines back together together with \n character
tmp <- paste(tmp, collapse = "\n")
fread(tmp)
}

# OP function using read.csv that is slow
readcsvMethod <- function(myfilename)
data.table(read.table(myfilename, header = TRUE, sep = ',', fill = TRUE))

require(microbenchmark)
microbenchmark(OPcsv = readcsvMethod("sample.csv"),
freadScan = fread2("sample.csv"),
freadtr = fread("tr -d \'\\r\' < sample.csv"),
unit = "relative")
## Unit: relative
## expr min lq mean median uq max neval
## OPcsv 1.331462 1.336524 1.340037 1.365397 1.366041 1.249223 100
## freadScan 1.532169 1.581195 1.624354 1.673691 1.676596 1.355434 100
## freadtr 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 100

关于r - 读取() : reading table with\r\r\n as newline symbol,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/33339656/

27 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com