gpt4 book ai didi

sql - 如何在 hive 中通过 collect_set() 操作使用 order by

转载 作者:行者123 更新时间:2023-12-02 06:29:09 25 4
gpt4 key购买 nike

在表 1 中,我有 customer_id、item_id 和 item_rank(根据一些销售额的项目排名)。我想为每个 customer_id 收集一个项目列表,并根据 item_rank 排列它们。

Customer_id  item_id rank_item
23 2 3
23 2 3
23 4 2
25 5 1
25 4 2

我期望的输出是
Customer_id    item_list
23 4,2
25 5,4

我使用的代码是
 SELECT
customer_id,
concat_ws(',',collect_list (string(item_id))) AS item_list
FROM
table1
GROUP BY
customer_id
ORDER BY
item_rank

最佳答案

可以使用子查询得到(customer_id, item_id, item_rank)的结果集,按item_rank排序,然后在外层查询中使用collect_set

查询

WITH table1 AS (
SELECT 23 AS customer_id, 2 AS item_id, 3 AS item_rank UNION ALL
SELECT 23 AS customer_id, 2 AS item_id, 3 AS item_rank UNION ALL
SELECT 23 AS customer_id, 4 AS item_id, 2 AS item_rank UNION ALL
SELECT 25 AS customer_id, 5 AS item_id, 1 AS item_rank UNION ALL
SELECT 25 AS customer_id, 4 AS item_id, 2 AS item_rank
)
SELECT
subquery.customer_id,
collect_set(subquery.item_id) AS item_id_set
FROM (
SELECT
table1.customer_id,
table1.item_id,
table1.item_rank
FROM table1
DISTRIBUTE BY
table1.customer_id
SORT BY
table1.customer_id,
table1.item_rank
) subquery
GROUP BY
subquery.customer_id
;

结果
    customer_id item_id_set
0 23 [4,2]
1 25 [5,4]

子查询使用 DISTRIBUTE BY 来保证特定 customer_id 的所有行都路由到同一个 reducer。然后使用 SORT BY 在每个 reducer 中按 customer_iditem_rank 排序。我希望这足以满足要求,因为我没有注意到对最终结果集进行总排序的要求。 (如果需要通过 customer_id 进行总排序,那么我认为查询必须使用 ORDER BY ,这会导致执行速度变慢。)

在内部, collect_set UDAF 使用 Java LinkedHashSet ,它是一个顺序保留集合,因此子查询中使用的相同排序顺序将在外部查询的集合中维护。这在 Hive 代码库中可见:

https://github.com/apache/hive/blob/release-2.0.0/ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDAFMkCollectionEvaluator.java#L93

关于sql - 如何在 hive 中通过 collect_set() 操作使用 order by,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/45092576/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com