gpt4 book ai didi

python - PySpark - 将列表的列转换为行

转载 作者:太空狗 更新时间:2023-10-30 02:37:51 28 4
gpt4 key购买 nike

我有一个 pyspark 数据框。我必须做一个分组,然后将某些列聚合到一个列表中,这样我就可以在数据框上应用 UDF。

例如,我创建了一个数据框,然后按人分组。

df = spark.createDataFrame(a, ["Person", "Amount","Budget", "Date"])
df = df.groupby("Person").agg(F.collect_list(F.struct("Amount", "Budget", "Date")).alias("data"))
df.show(truncate=False)
+------+----------------------------------------------------------------------------+
|Person|data |
+------+----------------------------------------------------------------------------+
|Bob |[[85.8,Food,2017-09-13], [7.8,Household,2017-09-13], [6.52,Food,2017-06-13]]|
+------+----------------------------------------------------------------------------+

我省略了 UDF,但 UDF 生成的数据框在下方。

+------+--------------------------------------------------------------+
|Person|res |
+------+--------------------------------------------------------------+
|Bob |[[562,Food,June,1], [380,Household,Sept,4], [880,Food,Sept,2]]|
+------+--------------------------------------------------------------+

我需要将生成的数据框转换为行,其中列表中的每个元素都是一个新行和一个新列。这可以在下面看到。

+------+------------------------------+
|Person|Amount|Budget |Month|Cluster|
+------+------------------------------+
|Bob |562 |Food |June |1 |
|Bob |380 |Household|Sept |4 |
|Bob |880 |Food |Sept |2 |
+------+------------------------------+

最佳答案

您可以使用 explodegetItem 如下:

# starting from this form:
+------+--------------------------------------------------------------
|Person|res |
+------+--------------------------------------------------------------+
|Bob |[[562,Food,June,1], [380,Household,Sept,4], [880,Food,Sept,2]]|
+------+--------------------------------------------------------------+
import pyspark.sql.functions as F

# explode res to have one row for each item in res
exploded_df = df.select("*", F.explode("res").alias("exploded_data"))
exploded_df.show(truncate=False)

# then use getItem to create separate columns
exploded_df = exploded_df.withColumn(
"Amount",
F.col("exploded_data").getItem("Amount") # either get by name or by index e.g. getItem(0) etc
)

exploded_df = exploded_df.withColumn(
"Budget",
F.col("exploded_data").getItem("Budget")
)

exploded_df = exploded_df.withColumn(
"Month",
F.col("exploded_data").getItem("Month")
)

exploded_df = exploded_df.withColumn(
"Cluster",
F.col("exploded_data").getItem("Cluster")
)

exploded_df.select("Person", "Amount", "Budget", "Month", "Cluster").show(10, False)

+------+------------------------------+
|Person|Amount|Budget |Month|Cluster|
+------+------------------------------+
|Bob |562 |Food |June |1 |
|Bob |380 |Household|Sept |4 |
|Bob |880 |Food |Sept |2 |
+------+------------------------------+

然后您可以删除不需要的列。希望这对您有所帮助,祝您好运!

关于python - PySpark - 将列表的列转换为行,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/48822381/

28 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com