gpt4 book ai didi

java - 使用 Apache Spark 读取 Json 文件

转载 作者:可可西里 更新时间:2023-11-01 14:18:20 28 4
gpt4 key购买 nike

我正在尝试使用 Spark v2.0.0 读取 Json 文件。在简单数据代码的情况下效果非常好。如果数据有点复杂,当我打印 df.show() 时,数据没有以正确的方式显示。

这是我的代码:

SparkSession session = SparkSession.builder().master("local").appName("jsonreader").getOrCreate();
Dataset<Row> list = session.read().json("/Users/hadoop/Desktop/sample.json");
list.show();

这是我的示例数据:

{
"glossary": {
"title": "example glossary",
"GlossDiv": {
"title": "S",
"GlossList": {
"GlossEntry": {
"ID": "SGML",
"SortAs": "SGML",
"GlossTerm": "Standard Generalized Markup Language",
"Acronym": "SGML",
"Abbrev": "ISO 8879:1986",
"GlossDef": {
"para": "A meta-markup language, used to create markup languages such as DocBook.",
"GlossSeeAlso": ["GML", "XML"]
},
"GlossSee": "markup"
}
}
}
}
}

我的输出是这样的:

+--------------------+
| _corrupt_record|
+--------------------+
| {|
| "glossary": {|
| "title": ...|
| "GlossDiv": {|
| "titl...|
| "GlossList": {|
| "...|
| ...|
| "SortAs": "S...|
| "GlossTerm":...|
| "Acronym": "...|
| "Abbrev": "I...|
| "GlossDef": {|
| ...|
| "GlossSeeAl...|
| ...|
| "GlossSee": ...|
| }|
| }|
| }|
+--------------------+
only showing top 20 rows

最佳答案

如果您必须阅读此 JSON,则需要将 JSON 格式化为一行。这是一个多行 JSON,因此没有被正确读取和加载 (One Object one Row)

引用 JSON API:

Loads a JSON file (one object per line) and returns the result as a DataFrame.

{"glossary":{"title":"example glossary","GlossDiv":{"title":"S","GlossList":{"GlossEntry":{"ID":"SGML","SortAs":"SGML","GlossTerm":"Standard Generalized Markup Language","Acronym":"SGML","Abbrev":"ISO 8879:1986","GlossDef":{"para":"A meta-markup language, used to create markup languages such as DocBook.","GlossSeeAlso":["GML","XML"]},"GlossSee":"markup"}}}}}

我只是在 shell 上试​​了一下,它应该以同样的方式从代码中工作(当我读取多行 JSON 时,我遇到了同样的损坏记录错误)

scala> val df = spark.read.json("C:/DevelopmentTools/data.json")
df: org.apache.spark.sql.DataFrame = [glossary: struct<GlossDiv: struct<GlossList: struct<GlossEntry: struct<Abbrev: string, Acronym: string ... 5 more fields>>, title: string>, title: string>]

scala>

编辑:

例如,您可以使用任何操作从该数据框中获取值

scala> df.select(df("glossary.GlossDiv.GlossList.GlossEntry.GlossTerm")).show()
+--------------------+
| GlossTerm|
+--------------------+
|Standard Generali...|
+--------------------+


scala>

你也应该能够从你的代码中做到这一点

关于java - 使用 Apache Spark 读取 Json 文件,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/40212464/

28 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com