gpt4 book ai didi

斯卡拉 Spark : Sum all columns across all rows

转载 作者:行者123 更新时间:2023-12-04 10:30:06 24 4
gpt4 key购买 nike

我可以很容易地做到这一点

df.groupBy().sum()

但我不确定 groupBy()不会增加额外的性能影响,或者只是糟糕的风格。我已经看到它完成了
df.agg( ("col1", "sum"), ("col2", "sum"), ("col3", "sum"))

它跳过了(我认为不必要的 groupBy),但有其自身的丑陋之处。这样做的正确方法是什么?使用 .groupBy().<aggOp>() 之间是否有任何内部差异?并使用 .agg ?

最佳答案

如果您查看 Physical plan 对于两个查询触发内部调用相同的计划,所以我们可以使用它们中的任何一个!

我想使用 df.groupBy().sum() 会很方便,因为我们不需要指定所有列名。

Example:

val df=Seq((1,2,3),(4,5,6)).toDF("id","j","k")

scala> df.groupBy().sum().explain
== Physical Plan ==
*(2) HashAggregate(keys=[], functions=[sum(cast(id#7 as bigint)), sum(cast(j#8 as bigint)), sum(cast(k#9 as bigint))])
+- Exchange SinglePartition
+- *(1) HashAggregate(keys=[], functions=[partial_sum(cast(id#7 as bigint)), partial_sum(cast(j#8 as bigint)), partial_sum(cast(k#9 as bigint))])
+- LocalTableScan [id#7, j#8, k#9]

scala> df.agg(sum("id"),sum("j"),sum("k")).explain
== Physical Plan ==
*(2) HashAggregate(keys=[], functions=[sum(cast(id#7 as bigint)), sum(cast(j#8 as bigint)), sum(cast(k#9 as bigint))])
+- Exchange SinglePartition
+- *(1) HashAggregate(keys=[], functions=[partial_sum(cast(id#7 as bigint)), partial_sum(cast(j#8 as bigint)), partial_sum(cast(k#9 as bigint))])
+- LocalTableScan [id#7, j#8, k#9]

关于斯卡拉 Spark : Sum all columns across all rows,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/60456585/

24 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com