gpt4 book ai didi

python - 当 ID 匹配时,在其他 Pyspark Dataframe 中逐列划分 Pyspark Dataframe

转载 作者:太空狗 更新时间:2023-10-29 21:25:34 25 4
gpt4 key购买 nike

我有一个 PySpark DataFrame,df1,它看起来像:

CustomerID  CustomerValue
12 .17
14 .15
14 .25
17 .50
17 .01
17 .35

我有第二个 PySpark DataFrame df2,它是按 CustomerID 分组并按求和函数聚合的 df1。它看起来像这样:

 CustomerID  CustomerValueSum
12 .17
14 .40
17 .86

我想向 df1 添加第三列,即 df1['CustomerValue'] 除以 df2['CustomerValueSum'] 以获得相同的 CustomerID。这看起来像:

CustomerID  CustomerValue  NormalizedCustomerValue
12 .17 1.00
14 .15 .38
14 .25 .62
17 .50 .58
17 .01 .01
17 .35 .41

换句话说,我正在尝试将此 Python/Pandas 代码转换为 PySpark:

normalized_list = []
for idx, row in df1.iterrows():
(
normalized_list
.append(
row.CustomerValue / df2[df2.CustomerID == row.CustomerID].CustomerValueSum
)
)
df1['NormalizedCustomerValue'] = [val.values[0] for val in normalized_list]

我该怎么做?

最佳答案

代码:

import pyspark.sql.functions as F

df1 = df1\
.join(df2, "CustomerID")\
.withColumn("NormalizedCustomerValue", (F.col("CustomerValue") / F.col("CustomerValueSum")))\
.drop("CustomerValueSum")

输出:

df1.show()

+----------+-------------+-----------------------+
|CustomerID|CustomerValue|NormalizedCustomerValue|
+----------+-------------+-----------------------+
| 17| 0.5| 0.5813953488372093|
| 17| 0.01| 0.011627906976744186|
| 17| 0.35| 0.4069767441860465|
| 12| 0.17| 1.0|
| 14| 0.15| 0.37499999999999994|
| 14| 0.25| 0.625|
+----------+-------------+-----------------------+

关于python - 当 ID 匹配时,在其他 Pyspark Dataframe 中逐列划分 Pyspark Dataframe,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/43287451/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com