gpt4 book ai didi

arrays - 重新排列 StrucType 和结构的嵌套数组

转载 作者:行者123 更新时间:2023-12-01 09:28:25 25 4
gpt4 key购买 nike

我有一个具有架构的数据框:

root
|-- col2: integer (nullable = true)
|-- col1: integer (nullable = true)
|-- structCol3: struct (nullable = true)
| |-- structField2: boolean (nullable = true)
| |-- structField1: string (nullable = true)
|-- structCol4: struct (nullable = true)
| |-- nestedArray: array (nullable = true)
| | |-- element: struct (containsNull = true)
| | | |-- elem3: double (nullable = true)
| | | |-- elem2: string (nullable = true)
| | | |-- elem1: string (nullable = true)
| |-- structField2: integer (nullable = true)

所需架构:

root
|-- col1: integer (nullable = true)
|-- col2: integer (nullable = true)
|-- structCol3: struct (nullable = true)
| |-- structField1: string (nullable = true)
| |-- structField2: boolean (nullable = true)
|-- structCol4: struct (nullable = true)
| |-- nestedArray: array (nullable = true)
| | |-- element: struct (containsNull = true)
| | | |-- elem1: string (nullable = true)
| | | |-- elem2: string (nullable = true)
| | | |-- elem3: double (nullable = true)
| |-- structField2: integer (nullable = true)

到目前为止,我已经成功地重新排列了结构内的列和字段,如下所示:

dfParquetOutput = df.select(
"col1",
"col2",
struct(
col("structCol3.structField1"),
col("structCol3.structField2")
).alias("structCol3"),
struct(
col("structCol4.nestedArray"),
col("structCol4.structField2")
).alias("structCol4")
)

不幸的是,我正在努力寻找一种方法来重新排列数组内 StructType 内的元素。我想过尝试使用 udf,但没有成功。

是否有一种简单的方法可以对数组内的 Struct 重新排序?

最佳答案

这里你无法真正避免udf(或RDD)。如果您将数据定义为

from pyspark.sql.functions import udf, struct, col
from collections import namedtuple

Outer = namedtuple("Outer", ["structCol4"])
Inner = namedtuple("Inner", ["nestedArray", "structField2"])
Element = namedtuple("Element", ["col3", "col2", "col1"])

df = spark.createDataFrame([Outer(Inner([Element("3", "2", "1")], 1))])

你可以

@udf("array<struct<col1: string, col2: string, col3: string>>")
def reorder(arr):
return [(col1, col2, col3) for col3, col2, col1 in arr]

result = df.withColumn(
"structCol4",
struct(reorder("structCol4.nestedArray").alias("nestedArray"), col("structCol4.structField2")))

result.printSchema()
# root
# |-- structCol4: struct (nullable = false)
# | |-- nestedArray: array (nullable = true)
# | | |-- element: struct (containsNull = true)
# | | | |-- col1: string (nullable = true)
# | | | |-- col2: string (nullable = true)
# | | | |-- col3: string (nullable = true)
# | |-- structField2: long (nullable = true)
#


result.show()
# +----------------+
# | structCol4|
# +----------------+
# |[[[1, 2, 3]], 1]|
# +----------------+

使用深度嵌套模式,您将在 udf 内重组完整的树,但这里不需要。

关于arrays - 重新排列 StrucType 和结构的嵌套数组,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/50152103/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com