gpt4 book ai didi

json - SPARK read.json 抛出 java.io.IOException : Too many bytes before newline

转载 作者:行者123 更新时间:2023-12-01 17:45:47 35 4
gpt4 key购买 nike

我在读取大型 6GB 单行 json 文件时遇到以下错误:

Job aborted due to stage failure: Task 5 in stage 0.0 failed 1 times, most recent failure: Lost task 5.0 in stage 0.0 (TID 5, localhost): java.io.IOException: Too many bytes before newline: 2147483648

spark 不会读取带有换行符的 json 文件,因此整个 6 GB json 文件位于一行:

jf = sqlContext.read.json("jlrn2.json")

配置:

spark.driver.memory 20g

最佳答案

是的,您的行中的字节数超过了 Integer.MAX_VALUE 个字节。你需要把它分开。

请记住,Spark 期望每一行都是有效的 JSON 文档,而不是整个文件。以下是来自 Spark SQL Progamming Guide 的相关行

Note that the file that is offered as a json file is not a typical JSON file. Each line must contain a separate, self-contained valid JSON object. As a consequence, a regular multi-line JSON file will most often fail.

因此,如果您的 JSON 文档采用以下形式...

[
{ [record] },
{ [record] }
]

您需要将其更改为

{ [record] }
{ [record] }

关于json - SPARK read.json 抛出 java.io.IOException : Too many bytes before newline,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/35990846/

35 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com