gpt4 book ai didi

scala - 通过Spark Dataframe中数组结构中的最后一项删除重复的数组结构

转载 作者:行者123 更新时间:2023-12-05 05:19:51 24 4
gpt4 key购买 nike

所以我的表看起来像这样:

customer_1|place|customer_2|item          |count
-------------------------------------------------
a | NY | b |(2010,304,310)| 34
a | NY | b |(2024,201,310)| 21
a | NY | b |(2010,304,312)| 76
c | NY | x |(2010,304,310)| 11
a | NY | b |(453,131,235) | 10

我试过了,但这并没有消除重复项,因为以前的数组仍然存在(应该是这样,我需要它来获得最终结果)。

val df=  df_one.withColumn("vs", struct(col("item").getItem(size(col("item"))-1), col("item"), col("count")))
.groupBy(col("customer_1"), col("place"), col("customer_2"))
.agg(max("vs").alias("vs"))
.select(col("customer_1"), col("place"), col("customer_2"), col("vs.item"), col("vs.count"))

我想按 customer_1、place 和 customer_2 列分组,并只返回最后一项 (-1) 唯一且计数最高的数组结构,有什么想法吗?

预期输出:

customer_1|place|customer_2|item          |count
-------------------------------------------------
a | NY | b |(2010,304,312)| 76
a | NY | b |(2010,304,310)| 34
a | NY | b |(453,131,235) | 10
c | NY | x |(2010,304,310)| 11

最佳答案

鉴于 dataframeschema

root
|-- customer_1: string (nullable = true)
|-- place: string (nullable = true)
|-- customer_2: string (nullable = true)
|-- item: array (nullable = true)
| |-- element: integer (containsNull = false)
|-- count: string (nullable = true)

您可以应用 concat 函数创建 temp 列来检查重复行,如下所示

import org.apache.spark.sql.functions._
df.withColumn("temp", concat($"customer_1",$"place",$"customer_2", $"item"(size($"item")-1)))
.dropDuplicates("temp")
.drop("temp")

你应该得到以下输出

+----------+-----+----------+----------------+-----+
|customer_1|place|customer_2|item |count|
+----------+-----+----------+----------------+-----+
|a |NY |b |[2010, 304, 312]|76 |
|c |NY |x |[2010, 304, 310]|11 |
|a |NY |b |[453, 131, 235] |10 |
|a |NY |b |[2010, 304, 310]|34 |
+----------+-----+----------+----------------+-----+

结构

给定 dataframeschema

root
|-- customer_1: string (nullable = true)
|-- place: string (nullable = true)
|-- customer_2: string (nullable = true)
|-- item: struct (nullable = true)
| |-- _1: integer (nullable = false)
| |-- _2: integer (nullable = false)
| |-- _3: integer (nullable = false)
|-- count: string (nullable = true)

我们仍然可以像上面那样做,只是稍微改变一下从 struct 中获取第三项作为

import org.apache.spark.sql.functions._
df.withColumn("temp", concat($"customer_1",$"place",$"customer_2", $"item._3"))
.dropDuplicates("temp")
.drop("temp")

希望回答对你有帮助

关于scala - 通过Spark Dataframe中数组结构中的最后一项删除重复的数组结构,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/45458302/

24 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com