gpt4 book ai didi

r - 使用fread读取带有双引号和不正确转义字符的数据

转载 作者:行者123 更新时间:2023-12-01 22:59:34 25 4
gpt4 key购买 nike

我尝试使用 data.table 包中的 fread() 加载大型数据文件(约 2000 万行)。然而,有些行造成了很大的麻烦。

最小示例:

text.csv contains:

id, text
1,"""Oops"",\""The"",""Georgia"""

fread("text.csv", sep=",")

Error in fread("text.csv", sep = ",") :
Not positioned correctly after testing format of header row. ch=','
In addition: Warning message:
In fread("text.csv", sep = ",") :
Starting data input on line 2 and discarding line 1 because it has too few or too many items to be column names or data: id, text

read.table() 效果稍好一些,但速度太慢且内存效率太低。

> read.table("text.csv", header = TRUE, sep=",")
id text
1 1 "Oops",\\"The","Georgia"

我意识到我的文本文件格式不正确,但它太大而无法实际编辑。

非常感谢任何帮助。

编辑:

实际数据记录的小样本:

sample1.txt, a good record:

materiale_id,dk5,description,creator,subject-phrase,title,type
125030-katalog:000000003,[78.793],Privatoptagelse. - Liveoptagelse,Frederik Lundin,,Koncert i Copenhagen Jazz House den 26.1.1995,music

> fread("sample1.txt", sep=",")
materiale_id dk5 description creator subject-phrase
1: 125030-katalog:000000003 [78.793] Privatoptagelse. - Liveoptagelse Frederik Lundin NA
title type
1: Koncert i Copenhagen Jazz House den 26.1.1995 music


sample2.txt, a good and a bad record:

materiale_id,dk5,description,creator,subject-phrase,title,type
125030-katalog:000000003,[78.793],Privatoptagelse. - Liveoptagelse,Frederik Lundin,,Koncert i Copenhagen Jazz House den 26.1.1995,music
150012-leksikon:100019,,"Databehandling vedrører rutiner og procedurer for datarepræsentation, lagring af data, overførsel af data mellem forskellige instanser eller brugere af data, beregninger eller andre operationer udført med...",,"[""Informatik"",""it"",""It, teknik og naturvidenskab"",""leksikonartikel"",""Software, programmering, internet og webkommunikation""]",it - elementer i databehandling,article

> fread("sample2.txt", sep=",")
Empty data.table (0 rows) of 11 cols: 150012-leksikon:100019,V2,Databehandling vedrører rutiner og procedurer for datarepræsentation, lagring af data, overførsel af data mellem forskellige instanser eller brugere af data, beregninger eller andre operationer udført med...,V4,[""Informatik","it"...

编辑2:

更新到 R 版本 3.2.3 和 data.table 1.9.6。对上述内容有帮助,但会与其他记录产生问题:

sample3.txt, a good and a bad record:

materiale_id,dk5,description,creator,subject-phrase,title,type
125030-katalog:000236595,,,Red Tampa Solist prf,"[""Tom"",""Georgia"",""1929-1930""]","Georgia Tom, 1929-1930",music
125030-katalog:000236596,,,Jane Lucas (Solist),"[""1928-1931"",""Tom,\""The"",""Georgia"",""Accompanist""]","Georgia Tom,""The Accompanist"" (1928-1931)",music

> s3 <- fread("sample3.txt", sep=",")
Error in fread("sample3.txt", sep = ",") :
Expecting 7 cols, but line 3 contains text after processing all cols. It is very likely that this is due to one or more fields having embedded sep=',' and/or (unescaped) '\n' characters within unbalanced unescaped quotes. fread cannot handle such ambiguous cases and those lines may not have been read in as expected. Please read the section on quotes in ?fread.

编辑3:

更新到数据表的开发版本 1.9.7 完全破坏了 fread():

> s3 <- fread("sample3.txt", sep=",")
Error in fread("sample3.txt", sep = ",") :
showProgress is not type integer but type 'logical'. Please report.

编辑4:

当记录包含字符串 \\" (乱七八糟,不是正则表达式)时,我的文件中似乎出现了问题。显然,反斜杠太多,导致 fread() 将双引号误解为字符串的结尾,而本应将其视为乱码。

到目前为止我最好的解决方案是这样做:

m1 <- readLines("data.csv", encoding="UTF-8")
m2 <- gsub("\\\\\"", "\\\"", m1)
writeLines(m2, "data_new.csv", useBytes = TRUE)
m3 <- fread("data_new.csv", encoding="UTF-8", sep=",")

这似乎有效。

虽然我不能百分百理解这一点,所以非常欢迎任何澄清。

最佳答案

不是data.table解决方案,但您可以尝试:

# read the file with 'readLines'
tmp <- readLines("trl.txt")

# create a column name vector of the first line
nms <- trimws(strsplit(tmp[1],',')[[1]])

# convert 'tmp' to a dataframe except the first line
tmp <- as.data.frame(tmp[-1])

# use 'separate' from 'tidyr' to split into two columns
library(tidyr)
df1 <- separate(tmp, "tmp[-1]", nms, sep=",", extra = "merge")

给出:

> df1
id text
1 1 """Oops"",\\""The"",""Georgia"""
<小时/>

编辑 1 的更新:使用新的示例数据 fread 似乎可以正常读取数据:

> s1 <- fread("sample1.txt", sep=",")
> s1
materiale_id dk5 description creator subject-phrase title type
1: 125030-katalog:000000003 [78.793] Privatoptagelse. - Liveoptagelse Frederik Lundin NA Koncert i Copenhagen Jazz House den 26.1.1995 music


> s2 <- fread("sample2.txt", sep=",")
> s2
materiale_id dk5
1: 125030-katalog:000000003 [78.793]
2: 150012-leksikon:100019
description
1: Privatoptagelse. - Liveoptagelse
2: Databehandling vedrører rutiner og procedurer for datarepræsentation, lagring af data, overførsel af data mellem forskellige instanser eller brugere af data, beregninger eller andre operationer udført med...
creator subject-phrase
1: Frederik Lundin
2: [""Informatik"",""it"",""It, teknik og naturvidenskab"",""leksikonartikel"",""Software, programmering, internet og webkommunikation""]
title type
1: Koncert i Copenhagen Jazz House den 26.1.1995 music
2: it - elementer i databehandling article
<小时/>

编辑 2 和 3 的更新:

当您查看错误消息时:

Error in fread("sample3.txt", sep = ",") : Expecting 7 cols, but line 3 contains text after processing all cols. It is very likely that this is due to one or more fields having embedded sep=',' and/or (unescaped) '\n' characters within unbalanced unescaped quotes. fread cannot handle such ambiguous cases and those lines may not have been read in as expected. Please read the section on quotes in ?fread.

然后当您查看 sample3.txt 的第二行时,您将看到第四列也包含逗号。您可以通过三个步骤解决此问题:

1:使用readLines读取文件,并将第四列的开始和结束字符替换为另一个引号字符:

r3 <- readLines("sample3.txt")
r3 <- gsub('\"[',"'",r3,fixed=TRUE)
r3 <- gsub(']\"',"'",r3,fixed=TRUE)

2:将其写回文本文件:

 writeLines(r3, "sample3-1.txt")

3:现在您可以使用fread(或read.table/read.csv)读取它。由于列标题的数量与列的数量不同,因此您必须使用 header = FALSE。还要将引号字符显式设置为步骤 2 中插入的新引号字符:

s3 <- fread("sample3-1.txt", quote = "\'", header = FALSE, skip = 1)

给出:

> s3
V1 V2 V3 V4 V5 V6 V7 V8
1: 125030-katalog:000236595 NA NA Red Tampa Solist prf ""Tom"",""Georgia"",""1929-1930"" "Georgia Tom 1929-1930" music
2: 125030-katalog:000236596 NA NA Jane Lucas (Solist) ""1928-1931"",""Tom,\\""The"",""Georgia"",""Accompanist"" "Georgia Tom ""The Accompanist"" (1928-1931)" music

之后,您可以按如下方式指定列名称:

names(s3) <- c("character","vector","with","eight","column","names")

注意:我为此使用了 v1.9.7 的最新版本(两周前)

关于r - 使用fread读取带有双引号和不正确转义字符的数据,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/35626797/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com