gpt4 book ai didi

读取嵌入双引号和逗号的 CSV 文件

转载 作者:行者123 更新时间:2023-12-01 22:42:41 25 4
gpt4 key购买 nike

我正在尝试使用 data.table 包中的 fread() 函数读取脏 CSV 文件,但在字符串值中嵌入双引号和逗号时遇到问题,即引用字段中存在未转义的双引号。以下示例数据说明了我的问题。它由 3 行/行和 6 列组成,第一行包含列名称:

"SA","SU","CC","CN","POC","PAC"
"NE","R","000","H "B", O","1","8"
"A","A","000","P","E,5","8"

第一个问题在第 2 行,其中嵌入了一对双引号和一个逗号:"H "B", O"。第二个问题在第 3 行,双引号内有一个逗号:"E,5"。我尝试过以下方法:

尝试 1

library(data.table)
x1 <- fread(file = "example.csv", quote = "\"")

输出:

> x1
V1 "SA" "SU" "CC" "CN" "POC" "PAC"
1: "NE" "R" 0 "H "B" O" "1" 8
2: "A" "A" 0 "P" "E 5" 8

消息:

Found and resolved improper quoting in first 100 rows. If the fields are not quoted (e.g. field separator does not appear within any field), try quote="" to avoid this warning.Detected 6 column names but the data has 7 columns (i.e. invalid file). Added 1 extra default column name for the first column which is guessed to be row names or an index. Use setnames() afterwards if this guess is not correct, or fix the file write command that created the file to create a valid file.

结论:结果不正确,因为它添加了新列V1

尝试 2

x2 <- fread(file = "example.csv", quote = "")

输出:

> x2
V1 "SA" "SU" "CC" "CN" "POC" "PAC"
1: "NE" "R" "000" "H "B" O" "1" "8"
2: "A" "A" "000" "P" "E 5" "8"

消息:

Detected 6 column names but the data has 7 columns (i.e. invalid file). Added 1 extra default column name for the first column which is guessed to be row names or an index. Use setnames() afterwards if this guess is not correct, or fix the file write command that created the file to create a valid file.

结论:结果不正确,因为它添加了新列V1..

解决方案?

我正在寻找一种获得类似于

的输出的方法
> x3
SA SU CC CN POC PAC
1: NE R 0 H 'B', O 1 8
2: A A 0 P E,5 8

最好使用 fread(),但欢迎其他建议。

最佳答案

您可以尝试事先清理数据并将双引号替换为单引号。

x = readLines('my_file.csv')
y = gsub('","', "','", x) # replace double quotes for each field
y = gsub('^"|"$', "'", y) # replace trailing and leading double quotes
z = paste(y, collapse='\n') # turn it back into a table for fread to read
df = fread(z, quote="'")
df

SA SU CC CN POC PAC
1: NE R 0 H "B", O 1 8
2: A A 0 P E,5 8

我无法确认这是否有效,因为我不知道您的文件有多大,但这可能是一种值得的方法。

关于读取嵌入双引号和逗号的 CSV 文件,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/52957453/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com