gpt4 book ai didi

scala - 分解 Spark 数据框中的嵌套结构

转载 作者:行者123 更新时间:2023-12-03 04:52:48 27 4
gpt4 key购买 nike

我正在研究 Databricks 示例。数据框的架构如下所示:

> parquetDF.printSchema
root
|-- department: struct (nullable = true)
| |-- id: string (nullable = true)
| |-- name: string (nullable = true)
|-- employees: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- firstName: string (nullable = true)
| | |-- lastName: string (nullable = true)
| | |-- email: string (nullable = true)
| | |-- salary: integer (nullable = true)

在示例中,他们展示了如何将员工列分解为 4 个附加列:

val explodeDF = parquetDF.explode($"employees") { 
case Row(employee: Seq[Row]) => employee.map{ employee =>
val firstName = employee(0).asInstanceOf[String]
val lastName = employee(1).asInstanceOf[String]
val email = employee(2).asInstanceOf[String]
val salary = employee(3).asInstanceOf[Int]
Employee(firstName, lastName, email, salary)
}
}.cache()
display(explodeDF)

我将如何对部门列执行类似的操作(即向数据框中添加两个名为“id”和“name”的附加列)?这些方法并不完全相同,我只能弄清楚如何使用以下方法创建一个全新的数据框:

val explodeDF = parquetDF.select("department.id","department.name")
display(explodeDF)

如果我尝试:

val explodeDF = parquetDF.explode($"department") { 
case Row(dept: Seq[String]) => dept.map{dept =>
val id = dept(0)
val name = dept(1)
}
}.cache()
display(explodeDF)

我收到警告和错误:

<console>:38: warning: non-variable type argument String in type pattern Seq[String] is unchecked since it is eliminated by erasure
case Row(dept: Seq[String]) => dept.map{dept =>
^
<console>:37: error: inferred type arguments [Unit] do not conform to method explode's type parameter bounds [A <: Product]
val explodeDF = parquetDF.explode($"department") {
^

最佳答案

在我看来,最优雅的解决方案是使用选择运算符星形扩展结构,如下所示:

var explodedDf2 = explodedDf.select("department.*","*")

https://docs.databricks.com/spark/latest/spark-sql/complex-types.html

关于scala - 分解 Spark 数据框中的嵌套结构,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/39275816/

27 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com