gpt4 book ai didi

python - 读取 csv 文件,其中引号内包含两个双引号和换行符

转载 作者:行者123 更新时间:2023-12-01 06:33:50 28 4
gpt4 key购买 nike

我有一个问题,我有一个大文件,我想用 Python 读取它,它看起来像:

"2019-10-09 10:11:09","NICK","Hello, how are you
today? I'm like ""weather"", often changing."

我想将此文件读入数据帧,如下所示:

col1                  col2          col3
2019-10-09 09:32:09 NICK Hello, how are you today? I'm like ""weather"", often changing.

我遇到了一些问题。首先,有一个问题,我的分隔符是“,”,它也在来自 col3 的一些消息中。第二个问题是,在来自 col3 的一些消息中存在换行符,我不知道如何处理(如“you”之后的示例)。最后一个问题是,来自 col3 的消息中还有两个双引号 '""""',它们代表消息中的引号。

我尝试使用以下方式读取此文件:

with open('/data/myfile.csv', 'r', encoding='utf-8') as csvfile:
df = pd.read_csv(csvfile, sep=",", quotechar='"', escapechar='\\')

不幸的是,这个方法不起作用。我不知道我解释的这三件事中哪一件导致了问题。它向我显示错误,它期望三列,但实际上几乎没有。

编辑:还有一些其他问题,因为它仍然显示此错误:

标记数据时出错。 C 错误:第 60 行应有 3 个字段,但看到了 5

当我查看该文件时,我不知道它是如何解释行的,因为我在来自 col3 的消息中收到了一些断行。我如何打印导致问题的确切行?

编辑2:我在终端中使用了这段代码:

sed -n 60p myfile.csv

它打印了空行。所以我也用前后几行来做到这一点。它看起来像:

"2019-10-09 10:11:09","som1","This isn't this.
It's like this, and this.

And as my opinions is this.

Finally, it's the end."

编辑3:@Boendal 是对的。我包含的这一行不会引起问题。现在我已将代码编辑为:

with open('opinions-ml.csv', 'r', encoding='utf-8') as csvfile:
df = pd.read_csv(csvfile, names=['col1', 'col2', 'col3'], sep=",", quotechar='"', escapechar='\\')

我发现问题是由这样的行引起的:

"2019-10-09 10:11:09","NICK","This is some text "and this, is quote" and it is also text
Awww. and, there was, line break"

Python 将此作为数据帧读取,如下所示:

col1                  col2          col3
2019-10-09 09:32:09 NICK This is some text and this
Awww. and there was line break

您认为有机会解决这个问题吗?也许使用正则表达式?或者我应该返回文件提供商来解决这个问题吗?

编辑4:另一行:

"2019-10-09 10:11:09","NICK","This is some text "and this is quote" and it is also text
Awww. and there, was line break"

Python 将此作为数据帧读取,如下所示:

col1                  col2            col3
2019-10-09 09:32:09 NICK This is some text and this is quote" and it is also text
Awww. and there was line break NaN

最佳答案

据我所知,csv 方言可能会有所帮助。以下代码产生正确的输出。

import pandas as pd
import csv

csv.register_dialect('mydialect', delimiter=',', quoting=csv.QUOTE_ALL, doublequote=True)
df = pd.read_csv('test.csv', dialect='mydialect')
df

解决方案 2:重新格式化数据

  • 前 2 列不需要任何处理。
  • 第三列需要转义。
  • 用 ,(逗号)和第三个索引中的转义值分割该行。

    import csv
    with open('test.csv') as infile, open('reformated_data.csv', 'w', newline='') as outfile:

    outputWriter = csv.writer(outfile, delimiter=',',
    escapechar='\\', quoting=csv.QUOTE_NONE)
    for line in infile:
    line = line.split(',')
    col12 = line[0:2]
    col3 = ''.join(line[2:]).encode("unicode_escape").decode("utf-8")
    outputWriter.writerow(col12 + [col3])

https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html https://docs.python.org/3/library/csv.html#dialects-and-formatting-parameters

关于python - 读取 csv 文件,其中引号内包含两个双引号和换行符,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/59765755/

28 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com