gpt4 book ai didi

python - Pyspark:根据两个RDD中两列的条件计算两个对应列的总和

转载 作者:太空宇宙 更新时间:2023-11-04 05:32:30 25 4
gpt4 key购买 nike

我有两个具有相同列的 RDD:
rdd1:-

+-----------------+|mid|uid|frequency|+-----------------+| m1| u1|        1|| m1| u2|        1|| m2| u1|        2|+-----------------+

rdd2 :-

+-----------------+|mid|uid|frequency|+-----------------+| m1| u1|       10|| m2| u1|       98|| m3| u2|       21|+-----------------+

I want to calculate sum of frequencies based on mid and uid. Result should be something like:

+-----------------+|mid|uid|frequency|+-----------------+| m1| u1|       11|| m2| u1|      100|| m3| u2|       21|+-----------------+

Thanks in advance.

EDIT:I achieved the solution in this way as well (Using map-reduce):

from pyspark.sql.functions import col

data1 = [("m1","u1",1),("m1","u2",1),("m2","u1",2)]
data2 = [("m1","u1",10),("m2","u1",98),("m3","u2",21)]
df1 = sqlContext.createDataFrame(data1,['mid','uid','frequency'])
df2 = sqlContext.createDataFrame(data2,['mid','uid','frequency'])

df3 = df1.unionAll(df2)
df4 = df3.map(lambda bbb: ((bbb['mid'], bbb['uid']), int(bbb['frequency'])))\
.reduceByKey(lambda a, b: a+b)

p = df4.map(lambda p: (p[0][0], p[0][1], p[1])).toDF()

p = p.select(col("_1").alias("mid"), \
col("_2").alias("uid"), \
col("_3").alias("frequency"))

p.show()

输出:

+---+---+---------+|mid|uid|frequency|+---+---+---------+| m2| u1|      100|| m1| u1|       11|| m1| u2|        1|| m3| u2|       21|+---+---+---------+

最佳答案

只需要按mid和uid进行分组,并进行求和操作即可:

data1 = [("m1","u1",1),("m1","u2",1),("m2","u1",2)]
data2 = [("m1","u1",10),("m2","u1",98),("m3","u2",21)]
df1 = sqlContext.createDataFrame(data1,['mid','uid','frequency'])
df2 = sqlContext.createDataFrame(data2,['mid','uid','frequency'])

df3 = df1.unionAll(df2)

df4 = df3.groupBy(df3.mid,df3.uid).sum() \
.withColumnRenamed("sum(frequency)","frequency")

df4.show()

# +---+---+---------+
# |mid|uid|frequency|
# +---+---+---------+
# | m1| u1| 11|
# | m1| u2| 1|
# | m2| u1| 100|
# | m3| u2| 21|
# +---+---+---------+

关于python - Pyspark:根据两个RDD中两列的条件计算两个对应列的总和,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/36661800/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com