gpt4 book ai didi

arrays - HiveQL:如何在数组 的列中查找重复元素

转载 作者:可可西里 更新时间:2023-11-01 15:07:56 25 4
gpt4 key购买 nike

我正在 HiveQL 中创建一个表,一列 duplicate_set 应该是包含另一列 list 列表中重复元素集的数组。例如给定一个表

+-----------+-------------------------+----------------------+
| id | list | duplicate_set |
+-----------+-------------------------+----------------------+
| 1 | ["1","2","2","3","3"] | ["2","3"] |
+-----------+-------------------------+----------------------+
| 2 | ["2","2","5","6"] | ["2"] |
+-----------+-------------------------+----------------------+
| 3 | ["2","4","5","6"] | [] |
...

提取重复元素并将它们放入集合中的最佳方法是什么?它有任何现有的 UDF 吗?谢谢。

最佳答案

您可以展开数组,计算row_number,然后将重复的元素(row_number>1)聚合到集合中:

with initial_data as (
select 1 id ,array("1","2","2","3","3") list union all
select 2 ,array("2","2","5","6") list union all
select 3 ,array("2","4","5","6")
)

select s.id, s.list, collect_set(case when s.rn>1 then x end) duplicate_set
from(
select s.id, s.list, l.x, row_number() over(partition by id, l.x) as rn
from initial_data s
lateral view explode(list) l as x --array element x
) s
group by s.id, s.list;

结果:

id      list                    duplicate_set
1 ["1","2","2","3","3"] ["2","3"]
2 ["2","2","5","6"] ["2"]
3 ["2","4","5","6"] []

关于arrays - HiveQL:如何在数组 <string> 的列中查找重复元素,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/57402070/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com