gpt4 book ai didi

scala - Spark CSV 包无法处理字段中的\n

转载 作者:行者123 更新时间:2023-12-04 19:56:28 26 4
gpt4 key购买 nike

我有一个 CSV 文件,我正在尝试使用 Spark CSV package 加载它并且它没有正确加载数据,因为很少有字段具有 \n在他们里面,例如以下两行

"XYZ", "Test Data", "TestNew\nline", "OtherData" 
"XYZ", "Test Data", "blablablabla
\nblablablablablalbal", "OtherData"

我正在使用以下代码,这很简单我正在使用 parserLibunivocity正如在互联网上阅读的那样,它解决了多个换行问题,但对我来说似乎并非如此。
 SQLContext sqlContext = new SQLContext(sc);
DataFrame df = sqlContext.read()
.format("com.databricks.spark.csv")
.option("inferSchema", "true")
.option("header", "true")
.option("parserLib","univocity")
.load("data.csv");

如何在以引号开头的字段中替换换行符。有没有更简单的方法?

最佳答案

根据SPARK-14194 (已解决为重复)不支持带有换行符的字段,并且永远不会支持。

I proposed to solve this via wholeFile option and it seems merged. I am resolving this as a duplicate of that as that one has a PR.


然而那是 Spark 2.0,你使用 spark-csv模块。
在引用的 SPARK-19610它已用 pull request 修复:

hmm, I understand the motivation for this, though my understanding with csv generally either avoid having newline in field or some implementation would require quotes around field value with newline


换句话说,使用 wholeFile Spark 2.x 中的选项(如 CSVDataSource 中所见)。
至于 spark-csv, this comment可能会有所帮助(突出显示我的):

However, that there are a quite bit of similar JIRAs complaining about this and the original CSV datasource tried to support this although that was incorrectly implemented. This tries to match it with JSON one at least and it might be better to provide a way to process such CSV files. Actually, current implementation requires quotes :). (It was told R supports this case too actually).


在 spark-csv 的 Features您可以找到以下内容:

The package also supports saving simple (non-nested) DataFrame. When writing files the API accepts several options:

  • quote: by default the quote character is ", but can be set to any character. This is written according to quoteMode.

  • quoteMode: when to quote fields (ALL, MINIMAL (default), NON_NUMERIC, NONE), see Quote Modes

关于scala - Spark CSV 包无法处理字段中的\n,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/44268262/

26 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com