gpt4 book ai didi

json - Apache Spark 读取带有额外列的 JSON

转载 作者:行者123 更新时间:2023-12-02 03:20:01 25 4
gpt4 key购买 nike

我正在读取一个 Hive 表,该表有两列:idjsonString。我可以轻松地将 jsonString 转换为 Spark 数据结构,调用 spark.read.json 函数,但我必须将列 id 添加为好吧。

val jsonStr1 = """{"fruits":[{"fruit":"banana"},{"fruid":"apple"},{"fruit":"pera"}],"bar":{"foo":"[\"daniel\",\"pedro\",\"thing\"]"},"daniel":"daniel data random","cars":["montana","bagulho"]}"""
val jsonStr2 = """{"fruits":[{"dt":"banana"},{"fruid":"apple"},{"fruit":"pera"}],"bar":{"foo":"[\"daniel\",\"pedro\",\"thing\"]"},"daniel":"daniel data random","cars":["montana","bagulho"]}"""
val jsonStr3 = """{"fruits":[{"a":"banana"},{"fruid":"apple"},{"fruit":"pera"}],"bar":{"foo":"[\"daniel\",\"pedro\",\"thing\"]"},"daniel":"daniel data random","cars":["montana","bagulho"]}"""


case class Foo(id: Integer, json: String)

val ds = Seq(new Foo(1,jsonStr1), new Foo(2,jsonStr2), new Foo(3,jsonStr3)).toDS
val jsonDF = spark.read.json(ds.select($"json").rdd.map(r => r.getAs[String](0)).toDS)

jsonDF.show()

jsonDF.show
+--------------------+------------------+------------------+--------------------+
| bar| cars| daniel| fruits|
+--------------------+------------------+------------------+--------------------+
|[["daniel","pedro...|[montana, bagulho]|daniel data random|[[,,, banana], [,...|
|[["daniel","pedro...|[montana, bagulho]|daniel data random|[[, banana,,], [,...|
|[["daniel","pedro...|[montana, bagulho]|daniel data random|[[banana,,,], [,,...|
+--------------------+------------------+------------------+--------------------+

我想从 Hive 表中添加列 id,如下所示:

+--------------------+------------------+------------------+--------------------+---------------
| bar| cars| daniel| fruits| id
+--------------------+------------------+------------------+--------------------+--------------
|[["daniel","pedro...|[montana, bagulho]|daniel data random|[[,,, banana], [,...|1
|[["daniel","pedro...|[montana, bagulho]|daniel data random|[[, banana,,], [,...|2
|[["daniel","pedro...|[montana, bagulho]|daniel data random|[[banana,,,], [,,...|3
+--------------------+------------------+------------------+--------------------+

我不会使用正则表达式

我创建了一个 udf,它将这两个字段作为参数,并使用适当的 JSON 库包含所需的 field(id) 并返回一个新的 JSON 字符串。它的工作方式就像一个魅力,但我希望 Spark API 提供更好的方法来做到这一点。我正在使用 Apache Spark 2.3.0。

最佳答案

我之前已经了解 from_json 函数,但就我而言,手动推断每个 JSON 的架构是“不可能的”。我认为 Spark 会有一个“惯用的”界面。

这是我的最终解决方案:

ds.select($"id", from_json($"json", jsonDF.schema).alias("_json_path")).select($"_json_path.*", $"id").show

ds.select($"id", from_json($"json", jsonDF.schema).alias("_json_path")).select($"_json_path.*", $"id").show

+--------------------+------------------+------------------+--------------------+---+
| bar| cars| daniel| fruits| id|
+--------------------+------------------+------------------+--------------------+---+
|[["daniel","pedro...|[montana, bagulho]|daniel data random|[[,,, banana], [,...| 1|
|[["daniel","pedro...|[montana, bagulho]|daniel data random|[[, banana,,], [,...| 2|
|[["daniel","pedro...|[montana, bagulho]|daniel data random|[[banana,,,], [,,...| 3|
+--------------------+------------------+------------------+--------------------+---+

关于json - Apache Spark 读取带有额外列的 JSON,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/55074331/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com