gpt4 book ai didi

scala - Spark 中的嵌套 JSON

转载 作者:行者123 更新时间:2023-12-04 05:34:23 24 4
gpt4 key购买 nike

我将以下 JSON 加载为 DataFrame:

root
|-- data: struct (nullable = true)
| |-- field1: string (nullable = true)
| |-- field2: string (nullable = true)
|-- moreData: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- more1: string (nullable = true)
| | |-- more2: string (nullable = true)
| | |-- more3: string (nullable = true)

我想从这个 DataFrame 中获得以下 RDD:

RDD[(more1, more2, more3, field1, field2)]

我怎样才能做到这一点?我想我必须使用 flatMap对于嵌套的 JSON?

最佳答案

explode的组合和点语法应该可以解决问题:

import org.apache.spark.sql.functions.explode

case class Data(field1: String, field2: String)
case class MoreData(more1: String, more2: String, more3: String)

val df = sc.parallelize(Seq(
(Data("foo", "bar"), Array(MoreData("a", "b", "c"), MoreData("d", "e", "f")))
)).toDF("data", "moreData")

df.printSchema
// root
// |-- data: struct (nullable = true)
// | |-- field1: string (nullable = true)
// | |-- field2: string (nullable = true)
// |-- moreData: array (nullable = true)
// | |-- element: struct (containsNull = true)
// | | |-- more1: string (nullable = true)
// | | |-- more2: string (nullable = true)
// | | |-- more3: string (nullable = true)

val columns = Seq(
$"moreData.more1", $"moreData.more2", $"moreData.more3",
$"data.field1", $"data.field2")

val aRDD = df.withColumn("moreData", explode($"moreData"))
.select(columns: _*)
.rdd

aRDD.collect
// Array[org.apache.spark.sql.Row] = Array([a,b,c,foo,bar], [d,e,f,foo,bar])

根据您的要求,您可以使用 map 从行中提取值:
import org.apache.spark.sql.Row

aRDD.map{case Row(m1: String, m2: String, m3: String, f1: String, f2: String) =>
(m1, m2, m3, f1, f2)}

另见 Querying Spark SQL DataFrame with complex types

关于scala - Spark 中的嵌套 JSON,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/34025528/

24 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com