gpt4 book ai didi

读取具有杂乱字符串和多个字符串指示符的大数据 R

转载 作者:行者123 更新时间:2023-12-02 02:46:28 28 4
gpt4 key购买 nike

我有一个大型 (8GB+) csv 文件(以逗号分隔),我想将其读入 R。该文件包含三列

  • 日期 #2017-12-27 格式
  • 文本 #字符串
  • type #每个字符串的标签(NAtypeAtypeB)

我遇到的问题是 text 列包含各种字符串指示符:'(单引号)、"(双引号) . 标记),没有引号,以及多个分隔的字符串。

例如

date        text                        type
2016-01-01 great job! NA
2016-01-02 please, type "submit" typeA
2016-01-02 "can't see the "error" now" typeA
2016-01-03 "add \\"/filename.txt\\"" NA

为了读取这些大数据,我尝试了:

  • 基本 read.csvreadrread_csv 函数:部分工作正常,但失败(可能是由于内存)或采取适合阅读的年龄
  • 通过 Mac 终端将数据分成 100 万行的批处理:失败,因为行似乎任意中断
  • 使用fread(首选,因为我希望这能解决另外两个问题):失败并显示错误:需要 3 列,但第 1103 行包含处理所有列后的文本。

我的想法是通过使用我所知道的数据细节来解决这些问题,即每行以日期开头并以 NAtypeA 结尾,或typeB

我如何实现这个(使用纯 readLinesfread)?

编辑:使用 Mac TextWrangler 打开的示例数据(匿名):

"date","text","type"
"2016-03-30","Maybe use `tapply` from `base`, and check how that works.",NA
"2016-04-01","Fiex this now. Please check.","typeA"
"2016-04-01","Does it work? Maybe try the other approach.","typeB"
"2016-04-01","This won't work. You should remove ABC ... each line starts with a date and ends with ... and this line is veeeeeeeeeeeeeeeeeery long.",NA
"2014-05-02","Tried to remove ""../"" but no success @myid",typeA

示例数据2:

"date","text","type"
"2018-05-02","i try this, but it doesnt work",NA
"2018-05-02","Thank you very much. Cheers !!",NA
"2018-05-02","@myid. I'll change this.",NA

可重现 fread 错误的示例数据 “预计有 3 列,但第 3 行包含处理所有列后的文本。”:

"date","text","type"
"2015-03-02","Some text, some text, some question? Please, some question?",NA
"2015-03-02","Here you have the error ""Can’t access {file \""Macintosh HD:abc:def:filename\"", \""/abc.txt\""} from directory."" something -1100 from {file ""Macintosh HD:abc:def:filename"", ""/abc.txt""} to file",NA
"2015-03-02","good idea",NA
"2015-03-02","Worked perfectly :)",NA

session 信息:

R version 3.5.0 (2018-04-23)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS High Sierra 10.13.5

Matrix products: default
BLAS: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats graphics grDevices utils datasets methods base

other attached packages:
[1] data.table_1.10.4-3 readr_1.1.1

loaded via a namespace (and not attached):
[1] compiler_3.5.0 assertthat_0.2.0 R6_2.2.2 cli_1.0.0
[5] hms_0.4.2 tools_3.5.0 pillar_1.2.2 rstudioapi_0.7
[9] tibble_1.4.2 yaml_2.1.19 crayon_1.3.4 Rcpp_0.12.16
[13] utf8_1.1.3 pkgconfig_2.0.1 rlang_0.2.0

最佳答案

readLines 方法可能是

infile <- file("test.txt", "r")
txt <- readLines(infile, n = 1)
df <- NULL

#change this value as per your requirement
chunksize <- 1

while(length(txt)){
txt <- readLines(infile, warn=F, n = chunksize)
df <- rbind(df, data.frame(date = gsub("\\s.*", "", txt),
text = trimws(gsub("\\S+(.*)\\s+\\S+$", "\\1", txt)),
type = gsub(".*\\s", "", txt),
stringsAsFactors = F))
}

这给出了

> df
date text type
1 2016-01-01 great job! NA
2 2016-01-02 please, type "submit" typeA
3 2016-01-02 "can't see the "error" now" typeA
4 2016-01-03 "add \\\\"/filename.txt\\\\"" NA

示例数据: test.txt 包含

date        text                        type
2016-01-01 great job! NA
2016-01-02 please, type "submit" typeA
2016-01-02 "can't see the "error" now" typeA
2016-01-03 "add \\"/filename.txt\\"" NA
<小时/>


更新:您可以使用下面的正则表达式解析器修改上面的代码来解析另一组示例数据

df  <- rbind(df, data.frame(date = gsub("\"(\\S{10}).*", "\\1", txt),
text = gsub(".*\"\\,\"(.*)\"\\,(\"|NA).*", "\\1", txt),
type = gsub(".*\\,|\"", "", txt),
stringsAsFactors = F))

另一组示例数据:

"date","text","type"
"2016-03-30","Maybe use `tapply` from `base`, and check how that works.",NA
"2016-04-01","Fiex this now. Please check.","typeA"
"2016-04-01","Does it work? Maybe try the other approach.","typeB"
"2016-04-01","This won't work. You should remove ABC ... each line starts with a date and ends with ... and this line is veeeeeeeeeeeeeeeeeery long.",NA
"2014-05-02","Tried to remove ""../"" but no success @myid","typeA"

关于读取具有杂乱字符串和多个字符串指示符的大数据 R,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/50851871/

28 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com