gpt4 book ai didi

scala - 基于SUM的IN聚合,使用Scala在Apache Spark Dataframe中选择特定的行值

转载 作者:行者123 更新时间:2023-12-02 20:23:34 25 4
gpt4 key购买 nike

我想在下面的数据集中作为SUM(QUANTITY)列进行聚合,并基于SUM,想要选择相关的行值,如下所述:

输入:

val df = Seq(
("Acc1","111","1111","Sec1",-4000),
("Acc2","222","2222","Sec1",2000),
("Acc3","333","3333","Sec1",1000),
("Acc4","444","4444","Sec1",-10000)
).toDF("ACCOUNT_NO","LONG_IND","SHORT_IND","SECURITY_ID","QUANTITY")

How to aggregation based on the SUM(QUANTITY) such that, in final result
(a) if Sum is negative, the row values with the maximum negative value (-10000 in this case) should take precedence
(b) if Sum is positive, the row values with the maximum positive value (2000) should take precedence

In this case, SUM is negative = -4000+2000+1000-10000 = -11000, so case (a) should take precedence in above dataset, to give result as below:

Desired Output after aggregation:
+-----------+----------+--------+---------+--------+
|SECURITY_ID|ACCOUNT_NO|LONG_IND|SHORT_IND|QUANTITY|
+-----------+----------+--------+---------+--------+
| SEC1| ACC4| 444| 4444| -11000|
+-----------+----------+--------+---------+--------+

尝试的方法:
val resultDF = df
.groupBy("SECURITY_ID")
.agg(
max($"SECURITY_ID").as("SECURITY_ID"),
max($"ACCOUNT_NO").as("ACCOUNT_NO"),
max(when($"SUM_QUANTITY" > 0, $"LONG_IND")).as("LONG_IND"),
max(when($"SUM_QUANTITY" < 0, $"SHORT_IND")).as("SHORT_IND"),
sum($"QUANTITY").cast("Long").as("SUM_QUANTITY")
)
.toDF("SECURITY_ID", "ACCOUNT_NO","LONG_IND","SHORT_IND","QUANTITY")

是否可以通过某种方式使用RANK获得此结果?

最佳答案

是的,您可以将窗口函数与row_number()结合使用以实现所需的功能:

val windowAsc = Window.partitionBy($"SECURITY_ID").orderBy($"QUANTITY".asc)
val windowDesc = Window.partitionBy($"SECURITY_ID").orderBy($"QUANTITY".desc)

df
.withColumn("sum",sum($"QUANTITY").over(window))
.withColumn("rnb",when($"sum"<0,row_number().over(windowAsc)).otherwise(row_number().over(windowDesc)))
.where($"rnb"===1)
.withColumn("QUANTITY",$"sum")
.drop("rnb","sum")
.show()

给出:
+----------+--------+---------+-----------+--------+
|ACCOUNT_NO|LONG_IND|SHORT_IND|SECURITY_ID|QUANTITY|
+----------+--------+---------+-----------+--------+
| Acc4| 444| 4444| Sec1| -11000|
+----------+--------+---------+-----------+--------+

关于scala - 基于SUM的IN聚合,使用Scala在Apache Spark Dataframe中选择特定的行值,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/58981325/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com