gpt4 book ai didi

scala - 获取 Spark 数据帧中 ArrayType 列的不同元素

转载 作者:行者123 更新时间:2023-12-01 00:41:09 27 4
gpt4 key购买 nike

我有一个包含 3 列名为 id、feat1 和 feat2 的数据框。 feat1 和 feat2 是字符串数组的形式:

Id, feat1,feat2
------------------
1, ["feat1_1","feat1_2","feat1_3"],[]

2, ["feat1_2"],["feat2_1","feat2_2"]

3,["feat1_4"],["feat2_3"]

我想获取每个特征列中不同元素的列表,因此输出将是:
distinct_feat1,distinct_feat2
-----------------------------
["feat1_1","feat1_2","feat1_3","feat1_4"],["feat2_1","feat2_2","feat2_3]

在 Scala 中执行此操作的最佳方法是什么?

最佳答案

您可以使用 collect_set在应用 explode 后找到对应列的不同值函数在每一列上取消嵌套每个单元格中的数组元素。假设您的数据框名为 df :

import org.apache.spark.sql.functions._

val distinct_df = df.withColumn("feat1", explode(col("feat1"))).
withColumn("feat2", explode(col("feat2"))).
agg(collect_set("feat1").alias("distinct_feat1"),
collect_set("feat2").alias("distinct_feat2"))

distinct_df.show
+--------------------+--------------------+
| distinct_feat1| distinct_feat2|
+--------------------+--------------------+
|[feat1_1, feat1_2...|[, feat2_1, feat2...|
+--------------------+--------------------+


distinct_df.take(1)
res23: Array[org.apache.spark.sql.Row] = Array([WrappedArray(feat1_1, feat1_2, feat1_3, feat1_4),
WrappedArray(, feat2_1, feat2_2, feat2_3)])

关于scala - 获取 Spark 数据帧中 ArrayType 列的不同元素,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/37801889/

27 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com