gpt4 book ai didi

scala - 如何跨 Array[DataFrame] 组合(加入)信息

转载 作者:行者123 更新时间:2023-12-01 13:47:49 25 4
gpt4 key购买 nike

我有一个 Array[DataFrame],我想检查每个数据框的每一行,按列的值是否有任何变化。假设我有第一行的三个数据框,例如:

 (0,1.0,0.4,0.1)
(0,3.0,0.2,0.1)
(0,5.0,0.4,0.1)

第一列是 ID,我对这个 ID 的理想输出是:

 (0, 1, 1, 0)

表示第二列和第三列发生了变化,而第三列没有变化。我在此处附上一些数据以复制我的设置

val rdd = sc.parallelize(Array((0,1.0,0.4,0.1),
(1,0.9,0.3,0.3),
(2,0.2,0.9,0.2),
(3,0.9,0.2,0.2),
(4,0.3,0.5,0.5)))
val rdd2 = sc.parallelize(Array((0,3.0,0.2,0.1),
(1,0.9,0.3,0.3),
(2,0.2,0.5,0.2),
(3,0.8,0.1,0.1),
(4,0.3,0.5,0.5)))
val rdd3 = sc.parallelize(Array((0,5.0,0.4,0.1),
(1,0.5,0.3,0.3),
(2,0.3,0.3,0.5),
(3,0.3,0.3,0.1),
(4,0.3,0.5,0.5)))
val df = rdd.toDF("id", "prop1", "prop2", "prop3")
val df2 = rdd2.toDF("id", "prop1", "prop2", "prop3")
val df3 = rdd3.toDF("id", "prop1", "prop2", "prop3")
val result:Array[DataFrame] = new Array[DataFrame](3)
result.update(0, df)
result.update(1,df2)
result.update(2,df3)

如何映射数组并获取输出?

最佳答案

您可以将 countDistinctgroupBy 一起使用:

import org.apache.spark.sql.functions.{countDistinct}

val exprs = Seq("prop1", "prop2", "prop3")
.map(c => (countDistinct(c) > 1).cast("integer").alias(c))

val combined = result.reduce(_ unionAll _)

val aggregatedViaGroupBy = combined
.groupBy($"id")
.agg(exprs.head, exprs.tail: _*)

aggregatedViaGroupBy.show
// +---+-----+-----+-----+
// | id|prop1|prop2|prop3|
// +---+-----+-----+-----+
// | 0| 1| 1| 0|
// | 1| 1| 0| 0|
// | 2| 1| 1| 1|
// | 3| 1| 1| 1|
// | 4| 0| 0| 0|
// +---+-----+-----+-----+

关于scala - 如何跨 Array[DataFrame] 组合(加入)信息,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/34501464/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com