gpt4 book ai didi

sql - 在SQL(Hive)中使用collect_list函数汇总用户序列

转载 作者:行者123 更新时间:2023-12-02 20:47:26 25 4
gpt4 key购买 nike

我有一个数据集,可以代表个人的使用方式,即他们必须输入值的形式。

表格可以按照“姓氏”以外的任何顺序完成,步骤1是。

时间戳用于暗示表单完成的顺序。

例如-用户12345678在2017-10-25的04:58:08开始填写表格,并依次完成表格。

用户12345679在同一天的05:00:02开始填写表单,但仅涉及步骤2

用户12345680在05:05:06开始填写表单,但分散了注意力,没有超出步骤1,但在完成之前又返回了两次

用户12345681从06:31:12开始填写表单,并从第1步开始,但随后随机填写该表单。

|    date  |     timestamp      | user_id  |  visit_id   |      event    |  event_seq  |
|--------------------------------------------------------------------------------------|
|2017-10-25| 2017-10-25 04:58:08| 12345678 | 1234 | firstname | 1 |
|2017-10-25| 2017-10-25 04:58:20| 12345678 | 1234 | lastname | 2 |
|2017-10-25| 2017-10-25 04:58:35| 12345678 | 1234 | dob | 3 |
|2017-10-25| 2017-10-25 04:58:40| 12345678 | 1234 | postcode | 4 |
|2017-10-25| 2017-10-25 04:58:40| 12345678 | 1234 | email | 5 |
|2017-10-25| 2017-10-25 05:00:02| 12345679 | 1235 | firstname | 1 |
|2017-10-25| 2017-10-25 05:00:10| 12345679 | 1235 | lastname | 2 |
|2017-10-25| 2017-10-25 05:05:06| 12345680 | 1236 | firstname | 1 |
|2017-10-25| 2017-10-25 05:30:24| 12345680 | 1236 | firstname | 1 |
|2017-10-25| 2017-10-25 06:17:24| 12345680 | 1236 | firstname | 1 |
|2017-10-25| 2017-10-25 06:20:30| 12345680 | 1236 | lastname | 2 |
|2017-10-25| 2017-10-25 06:20:45| 12345680 | 1236 | dob | 3 |
|2017-10-25| 2017-10-25 06:20:45| 12345680 | 1236 | postcode | 4 |
|2017-10-25| 2017-10-25 06:20:45| 12345680 | 1236 | email | 5 |
|2017-10-25| 2017-10-25 06:31:12| 12345681 | 1237 | firstname | 1 |
|2017-10-25| 2017-10-25 06:31:18| 12345681 | 1237 | email | 5 |
|2017-10-25| 2017-10-25 06:31:50| 12345681 | 1237 | lastname | 2 |
|2017-10-25| 2017-10-25 06:32:16| 12345681 | 1237 | postcode | 4 |
|2017-10-25| 2017-10-25 06:32:40| 12345681 | 1237 | dob | 3 |

我编写的代码如下,并引用了一个预先存在的表,其中CASE WHEN用于在名为“事件”的变量中为表单的每个步骤分配一个数字:
SELECT date
,time_stamp
,user_id
,visit_id
,collect_list(events) as event_seq
FROM my_table
GROUP BY date
,start_time
time_stamp
,user_id
,visit_id

不出所料,这似乎将所有用户12345680的交互都列出在一个字符串中。
|    date  |     timestamp      | user_id  |  visit_id   |  event_seq  |
|----------------------------------------------------------------------|
|2017-10-25| 2017-10-25 04:58:08| 12345678 | 1234 | 1,2,3,4,5 |
|2017-10-25| 2017-10-25 05:00:02| 12345679 | 1235 | 1,2 |
|2017-10-25| 2017-10-25 05:05:06| 12345680 | 1236 |1,1,1,2,3,4,5|
|2017-10-25| 2017-10-25 06:31:12| 12345681 | 1237 | 1,5,2,4,3, |

但是,我希望看到的是序列中第一个事件标记的每一行,类似于下面的结果集,其中每个用户12345680的重新启动都发生在不同的行上。
|    date  |     timestamp      | user_id  |  visit_id   |  event_seq  |
|----------------------------------------------------------------------|
|2017-10-25| 2017-10-25 04:58:08| 12345678 | 1234 | 1,2,3,4,5 |
|2017-10-25| 2017-10-25 05:00:02| 12345679 | 1235 | 1,2 |
|2017-10-25| 2017-10-25 05:05:06| 12345680 | 1236 | 1 |
|2017-10-25| 2017-10-25 05:30:24| 12345680 | 1236 | 1 |
|2017-10-25| 2017-10-25 06:17:24| 12345680 | 1236 | 1,2,3,4,5 |
|2017-10-25| 2017-10-25 06:31:12| 12345681 | 1237 | 1,5,2,4,3, |

有人可以提供有关如何使用collect_list达到所需结果集的任何指导吗?

最佳答案

在您的SQL中,似乎在表中有一列名为start_time的列。假设您有一个,请参见下面的解决方案


CREATE EXTERNAL TABLE my_table(
event_date DATE,
event_start_timestamp TIMESTAMP,
event_timestamp TIMESTAMP,
user_id STRING,
visit_id STRING,
event STRING,
event_seq STRING)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '|'
STORED AS TEXTFILE;

数据
2017-10-25| 2017-10-25 04:58:08| 2017-10-25 04:58:08| 12345678 |    1234     |   firstname   |       1     
2017-10-25| 2017-10-25 04:58:08| 2017-10-25 04:58:20| 12345678 | 1234 | lastname | 2
2017-10-25| 2017-10-25 04:58:08| 2017-10-25 04:58:35| 12345678 | 1234 | dob | 3
2017-10-25| 2017-10-25 04:58:08| 2017-10-25 04:58:40| 12345678 | 1234 | postcode | 4
2017-10-25| 2017-10-25 04:58:08| 2017-10-25 04:58:40| 12345678 | 1234 | email | 5
2017-10-25| 2017-10-25 05:00:02| 2017-10-25 05:00:02| 12345679 | 1235 | firstname | 1
2017-10-25| 2017-10-25 05:00:02| 2017-10-25 05:00:10| 12345679 | 1235 | lastname | 2
2017-10-25| 2017-10-25 05:05:06| 2017-10-25 05:05:06| 12345680 | 1236 | firstname | 1
2017-10-25| 2017-10-25 05:30:24| 2017-10-25 05:30:24| 12345680 | 1236 | firstname | 1
2017-10-25| 2017-10-25 06:17:24| 2017-10-25 06:17:24| 12345680 | 1236 | firstname | 1
2017-10-25| 2017-10-25 06:17:24| 2017-10-25 06:20:30| 12345680 | 1236 | lastname | 2
2017-10-25| 2017-10-25 06:17:24| 2017-10-25 06:20:45| 12345680 | 1236 | dob | 3
2017-10-25| 2017-10-25 06:17:24| 2017-10-25 06:20:45| 12345680 | 1236 | postcode | 4
2017-10-25| 2017-10-25 06:17:24| 2017-10-25 06:20:45| 12345680 | 1236 | email | 5
2017-10-25| 2017-10-25 06:31:12| 2017-10-25 06:31:12| 12345681 | 1237 | firstname | 1
2017-10-25| 2017-10-25 06:31:12| 2017-10-25 06:31:18| 12345681 | 1237 | email | 5
2017-10-25| 2017-10-25 06:31:12| 2017-10-25 06:31:50| 12345681 | 1237 | lastname | 2
2017-10-25| 2017-10-25 06:31:12| 2017-10-25 06:32:16| 12345681 | 1237 | postcode | 4
2017-10-25| 2017-10-25 06:31:12| 2017-10-25 06:32:40| 12345681 | 1237 | dob | 3

SQL查询
SELECT event_date,
user_id,
visit_id,
event_start_timestamp,
collect_list(event_seq)
FROM (SELECT event_date,
event_start_timestamp,
event_timestamp,
user_id,
visit_id,
event_seq
FROM my_table
SORT BY user_id, visit_id, event_start_timestamp, event_timestamp ASC) v
GROUP BY event_date, user_id, visit_id, event_start_timestamp ;

输出
2017-10-25   12345678       1234        2017-10-25 04:58:08 ["       1     ","       2     ","       3     ","       4     ","       5     "]
2017-10-25 12345679 1235 2017-10-25 05:00:02 [" 1 "," 2 "]
2017-10-25 12345680 1236 2017-10-25 05:05:06 [" 1 "]
2017-10-25 12345680 1236 2017-10-25 05:30:24 [" 1 "]
2017-10-25 12345680 1236 2017-10-25 06:17:24 [" 1 "," 2 "," 3 "," 4 "," 5 "]
2017-10-25 12345681 1237 2017-10-25 06:31:12 [" 1 "," 5 "," 2 "," 4 "," 3 "]

让我们知道这是否有效!

并且请不要使用列名作为DATE,TIMESTAMP等,它们是保留字:)

关于sql - 在SQL(Hive)中使用collect_list函数汇总用户序列,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/47398295/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com