gpt4 book ai didi

sql - 具有间隙约束的 BigQuery 平均时间

转载 作者:行者123 更新时间:2023-12-01 00:16:38 26 4
gpt4 key购买 nike

我有一个表示消息的 BigQuery 表,每条消息都属于一个对话并有一个日期:

conversation    date    
1 2018-06-22 23:16:46.456 UTC
2 2018-06-05 00:07:12.178 UTC
1 2018-06-22 23:16:46.456 UTC
4 2018-06-05 00:07:12.178 UTC
3 2018-06-22 23:51:28.540 UTC
3 2018-06-23 00:02:59.285 UTC
4 2018-06-04 23:21:59.500 UTC

我需要获得对话的平均时间

我使用这个查询来获取它:
SELECT conversation, timestamp_diff(MAX(date), MIN(date), MINUTE) minutes
FROM `Message`
GROUP BY conversation

但是由于一些对话需要几天时间,当消息之间的间隔大于 1 小时时,它们必须被切成更小的块,例如:
conversation    date    
2 2018-06-22 00:01:46.456 UTC # group 1
2 2018-06-22 00:07:12.178 UTC # group 1
2 2018-06-22 00:16:46.456 UTC # group 1
2 2018-06-22 01:07:42.178 UTC # group 1
there is a gap here
2 2018-06-22 12:51:28.540 UTC # group 2
2 2018-06-22 13:00:40.486 UTC # group 2
there is another gap here
2 2018-06-22 19:54:30.031 UTC # group 3

我认为这可以使用分析函数:
https://cloud.google.com/bigquery/docs/reference/standard-sql/analytic-function-concepts

但我不知道该怎么做,任何帮助都会非常有用。

最佳答案

下面是 BigQuery 标准 SQL

they have to be cut into smaller chunks when a the gap between messages is larger than 1 hour:


#standardSQL
WITH `project.dataset.table` AS (
SELECT 2 conversation, TIMESTAMP '2018-06-22 00:01:46.456 UTC' dt UNION ALL # group 1
SELECT 2, '2018-06-22 00:07:12.178 UTC' UNION ALL # group 1
SELECT 2, '2018-06-22 00:16:46.456 UTC' UNION ALL # group 1
SELECT 2, '2018-06-22 01:07:42.178 UTC' UNION ALL # group 1
SELECT 2, '2018-06-22 12:51:28.540 UTC' UNION ALL # group 2
SELECT 2, '2018-06-22 13:00:40.486 UTC' UNION ALL # group 2
SELECT 2, '2018-06-22 19:54:30.031 UTC' # group 3
), conversation_groups AS (
SELECT
conversation, dt,
SUM(flag) OVER(PARTITION BY conversation ORDER BY dt) conversation_group
FROM (
SELECT
conversation, dt,
SIGN(IFNULL(TIMESTAMP_DIFF(dt, LAG(dt) OVER(PARTITION BY conversation ORDER BY dt), HOUR), 0)) flag
FROM `project.dataset.table`
)
)
SELECT *
FROM conversation_groups
ORDER BY conversation, dt

结果为
Row conversation    dt                              conversation_group   
1 2 2018-06-22 00:01:46.456 UTC 0
2 2 2018-06-22 00:07:12.178 UTC 0
3 2 2018-06-22 00:16:46.456 UTC 0
4 2 2018-06-22 01:07:42.178 UTC 0
5 2 2018-06-22 12:51:28.540 UTC 1
6 2 2018-06-22 13:00:40.486 UTC 1
7 2 2018-06-22 19:54:30.031 UTC 2

I need to get the average time spent in a conversation


#standardSQL
WITH `project.dataset.table` AS (
SELECT 2 conversation, TIMESTAMP '2018-06-22 00:01:46.456 UTC' dt UNION ALL # group 1
SELECT 2, '2018-06-22 00:07:12.178 UTC' UNION ALL # group 1
SELECT 2, '2018-06-22 00:16:46.456 UTC' UNION ALL # group 1
SELECT 2, '2018-06-22 01:07:42.178 UTC' UNION ALL # group 1
SELECT 2, '2018-06-22 12:51:28.540 UTC' UNION ALL # group 2
SELECT 2, '2018-06-22 13:00:40.486 UTC' UNION ALL # group 2
SELECT 2, '2018-06-22 19:54:30.031 UTC' # group 3
), conversation_groups AS (
SELECT
conversation, dt,
SUM(flag) OVER(PARTITION BY conversation ORDER BY dt) conversation_group
FROM (
SELECT
conversation, dt,
SIGN(IFNULL(TIMESTAMP_DIFF(dt, LAG(dt) OVER(PARTITION BY conversation ORDER BY dt), HOUR), 0)) flag
FROM `project.dataset.table`
)
)
SELECT conversation, AVG(IF(duration = 0, NULL, duration)) avg_duration
FROM (
SELECT
conversation, conversation_group,
TIMESTAMP_DIFF(MAX(dt), MIN(dt), MINUTE) duration
FROM conversation_groups
GROUP BY conversation, conversation_group
)
GROUP BY conversation
ORDER BY conversation

结果为
Row conversation    avg_duration     
1 2 37.0

注意:您可以根据您的特定需求/对如何执行此操作的看法来调整计算 avg 的逻辑 - 但上面的方法是 - 首先计算每个组的持续时间,然后计算这些组的平均持续时间 - 和注意:如果持续时间为零,则将其替换为 NULL,因此不会影响平均值计算。持续时间以 MINUTE 为单位计算,但您可以使用 SECOND 或任何您需要的

关于sql - 具有间隙约束的 BigQuery 平均时间,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/52087555/

26 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com