gpt4 book ai didi

C++ CSV 在引号内用逗号解析

转载 作者:太空狗 更新时间:2023-10-29 19:58:48 25 4
gpt4 key购买 nike

我正在构建一个 C++ CSV 数据解析器。我正在尝试访问文件的第一列和第十五列,并使用 getline 命令将它们读入两个数组。例如:

for(int j=0;j<i;j++)
{
getline(posts2,postIDs[j],',');
for(int k=0;k<14;k++)
{
getline(posts2,tossout,',');
}
getline(posts2,answerIDs[j],',');
getline(posts2,tossout,'\r');

但是,在第一列和第十五列之间是一个用引号引起来的列,其中包含各种逗号和松散的引号。例如:

...,"abc, defghijk. "Lmnopqrs, "tuv,"" wxyz.",... <

避免此列的最佳方法是什么?我无法理解它,因为它里面有引号和逗号。遇到引用后,我是否应该逐个字符地阅读引用的垃圾,直到找到 ", 的顺序?

此外,我还看到了其他解决方案,但它们都是 Windows/Visual Studio 独有的。我正在运行 Mac OSX 版本。 10.8.3 与 Xcode 3.2.3。

提前致谢!画画

最佳答案

CSV 格式没有正式的标准,但我们首先要注意你引用的那篇丑陋的专栏:

"abc, defghijk. "Lmnopqrs, "tuv,"" wxyz.",

不符合被认为是 Basic Rules 的内容的 CSV,因为其中两个是:-

  • 1) 内嵌逗号的字段必须用引号引起来。

  • 2) 每个嵌入的双引号字符必须由一对双引号字符表示。

如果问题列遵守规则 1),则它不遵守规则 2)。但是我们可以这样解释为了遵守规则 1) - 所以我们可以说它在哪里结束 - 如果我们平衡双引号,例如

[abc, defghijk. [Lmnopqrs, ]tuv,[] wxyz.],

平衡的最外层引号将列括起来。平衡的内部报价可能只是缺少任何其他内部迹象,除了平衡使它们成为内部。

我们想要一个将此文本解析为一列的规则,与规则 1 一致),并且还将解析列确实也遵守规则 2)。刚刚展示的平衡表明了这一点可以做到,因为遵守这两个规则的列必然是也可以平衡。

建议的规则是:

  • 列运行到前面有 0 个双引号或前面是偶数个双引号中的最后一个。

如果逗号之前有偶数个双引号,那么我们知道我们可以至少以一种方式平衡封闭引号和其余部分。

您正在考虑的更简单的规则:

After running into a quote, should I read the quoted junk character-by-character until I find ", in sequence?

如果遇到某些确实遵守规则 2 的列,则会失败,例如

"Super, ""luxurious"", truck",

更简单的规则将终止 ""luxurious"" 之后的列.但由于此列符合规则 2),相邻的双引号被“转义”双引号引号,没有定界意义。另一方面建议规则仍然正确解析该列,在 truck" 之后终止它.

这是一个演示程序,其中函数 get_csv_column解析列根据建议的规则:

#include <iostream>
#include <fstream>
#include <cstdlib>

using namespace std;

/*
Assume `in` is positioned at start of column.
Accumulates chars from `in` as long as `in` is good
until either:-
- Have consumed a comma preceded by 0 quotes,or
- Have consumed a comma immediately preceded by
the last of an even number of quotes.
*/
std::string get_csv_column(ifstream & in)
{
std::string col;
unsigned quotes = 0;
char prev = 0;
bool finis = false;
for (int ch; !finis && (ch = in.get()) != EOF; ) {
switch(ch) {
case '"':
++quotes;
break;
case ',':
if (quotes == 0 || (prev == '"' && (quotes & 1) == 0)) {
finis = true;
}
break;
default:;
}
col += prev = ch;
}
return col;
}

int main()
{
ifstream in("csv.txt");
if (!in) {
cout << "Open error :(" << endl;
exit(EXIT_FAILURE);
}
for (std::string col; in; ) {
col = get_csv_column(in),
cout << "<[" << col << "]>" << std::endl;
}
if (!in && !in.eof()) {
cout << "Read error :(" << endl;
exit(EXIT_FAILURE);
}
exit(EXIT_SUCCESS);
}

它包含了 <[...]> 中的每一列,不打折换行符,并且包括每列的终端“,”:

文件csv.txt是:

...,"abc, defghijk. "Lmnopqrs, "tuv,"" wxyz.",...,
",","",
Year,Make,Model,Description,Price,
1997,Ford,E350,"Super, ""luxurious"", truck",
1997,Ford,E350,"Super, ""luxurious"" truck",
1997,Ford,E350,"ac, abs, moon",3000.00,
1999,Chevy,"Venture ""Extended Edition""","",4900.00,
1999,Chevy,"Venture ""Extended Edition, Very Large""",,5000.00,
1996,Jeep,Grand Cherokee,"MUST SELL!
air, moon roof, loaded",4799.00,

输出是:

<[...,]>
<["abc, defghijk. "Lmnopqrs, "tuv,"" wxyz.",]>
<[...,]>
<[
",",]>
<["",]>
<[
Year,]>
<[Make,]>
<[Model,]>
<[Description,]>
<[Price,]>
<[
1997,]>
<[Ford,]>
<[E350,]>
<["Super, ""luxurious"", truck",]>
<[
1997,]>
<[Ford,]>
<[E350,]>
<["Super, ""luxurious"" truck",]>
<[
1997,]>
<[Ford,]>
<[E350,]>
<["ac, abs, moon",]>
<[3000.00,]>
<[
1999,]>
<[Chevy,]>
<["Venture ""Extended Edition""",]>
<["",]>
<[4900.00,]>
<[
1999,]>
<[Chevy,]>
<["Venture ""Extended Edition, Very Large""",]>
<[,]>
<[5000.00,]>
<[
1996,]>
<[Jeep,]>
<[Grand Cherokee,]>
<["MUST SELL!
air, moon roof, loaded",]>
<[4799.00]>

关于C++ CSV 在引号内用逗号解析,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/17738992/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com