gpt4 book ai didi

pyspark - 展平 PySpark 数组中的嵌套结构

转载 作者:行者123 更新时间:2023-12-05 06:38:26 26 4
gpt4 key购买 nike

给定如下模式:

root
|-- first_name: string
|-- last_name: string
|-- degrees: array
| |-- element: struct
| | |-- school: string
| | |-- advisors: struct
| | | |-- advisor1: string
| | | |-- advisor2: string

我怎样才能得到这样的模式:

root
|-- first_name: string
|-- last_name: string
|-- degrees: array
| |-- element: struct
| | |-- school: string
| | |-- advisor1: string
| | |-- advisor2: string

目前,我分解数组,通过选择 advisor.* 来展平结构然后按 first_name, last_name 分组并用 collect_list 重建数组.我希望有一种更清洁/更短的方法来做到这一点。目前,重命名一些我不想在这里涉及的领域和内容非常痛苦。谢谢!

最佳答案

您可以使用 udf 更改数据框中嵌套列的数据类型。假设您已将数据帧读取为 df1

from pyspark.sql.functions import udf
from pyspark.sql.types import *

def foo(data):
return
(
list(map(
lambda x: (
x["school"],
x["advisors"]["advisor1"],
x["advisors"]["advisor1"]
),
data
))
)

struct = ArrayType(
StructType([
StructField("school", StringType()),
StructField("advisor1", StringType()),
StructField("advisor2", StringType())
])
)
udf_foo = udf(foo, struct)

df2 = df1.withColumn("degrees", udf_foo("degrees"))
df2.printSchema()

输出:

root
|-- degrees: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- school: string (nullable = true)
| | |-- advisor1: string (nullable = true)
| | |-- advisor2: string (nullable = true)
|-- first_name: string (nullable = true)
|-- last_name: string (nullable = true)

关于pyspark - 展平 PySpark 数组中的嵌套结构,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/46178325/

26 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com