gpt4 book ai didi

sql - BigQuery : How to merge HLL Sketches over a window function?(在滚动窗口上计算不同值)

转载 作者:行者123 更新时间:2023-12-03 03:36:41 25 4
gpt4 key购买 nike

相关表架构示例:

+---------------------------+-------------------+
| activity_date - TIMESTAMP | user_id - STRING |
+---------------------------+-------------------+
| 2017-02-22 17:36:08 UTC | fake_id_i24385787 |
+---------------------------+-------------------+
| 2017-02-22 04:27:08 UTC | fake_id_234885747 |
+---------------------------+-------------------+
| 2017-02-22 08:36:08 UTC | fake_id_i24385787 |
+---------------------------+-------------------+

我需要在滚动时间段(90 天)内对大型数据集的活跃不同用户进行计数,但由于数据集的大小而遇到了问题。

首先,我尝试使用窗口函数,类似于此处的答案。 https://stackoverflow.com/a/27574474

WITH
daily AS (
SELECT
DATE(activity_date) day,
user_id
FROM
`fake-table`)
SELECT
day,
SUM(APPROX_COUNT_DISTINCT(user_id)) OVER (ORDER BY day ROWS BETWEEN 89 PRECEDING AND CURRENT ROW) ninty_day_window_apprx
FROM
daily
GROUP BY
1
ORDER BY
1 DESC

但是,这会导致每天获得不同的用户数量,然后将这些数量相加 - 但如果不同的用户出现多次,则可能会在窗口内重复。因此,这并不是对 90 天内不同用户的真正准确衡量。

我尝试的下一件事是使用以下解决方案 https://stackoverflow.com/a/47659590- 将每个窗口的所有不同的 user_id 连接到一个数组,然后计算其中的不同值。

WITH daily AS (
SELECT date(activity_date) day, STRING_AGG(DISTINCT user_id) users
FROM `fake-table`
GROUP BY day
), temp2 AS (
SELECT
day,
STRING_AGG(users) OVER(ORDER BY UNIX_DATE(day) RANGE BETWEEN 89 PRECEDING AND CURRENT ROW) users
FROM daily
)

SELECT day,
(SELECT APPROX_COUNT_DISTINCT(id) FROM UNNEST(SPLIT(users)) AS id) Unique90Days
FROM temp2

order by 1 desc

然而,任何大的东西很快就会耗尽内存。

下一步是使用 HLL 草图以更小的值来表示不同的 ID,这样内存就不再是问题了。我以为我的问题已经解决,但运行以下命令时出现错误:错误只是“不支持函数 MERGE_PARTIAL”。我也尝试使用 MERGE 并得到同样的错误。仅在使用窗口函数时才会发生。为每天的值(value)创建草图效果很好。

我通读了 BigQuery 标准 SQL 文档,但没有看到有关窗口函数的 HLL_COUNT.MERGE_PARTIAL 和 HLL_COUNT.MERGE 的任何内容。据推测,这应该采用 90 个草图并将它们组合成一个 HLL 草图,代表 90 个原始草图之间的不同值?

WITH
daily AS (
SELECT
DATE(activity_date) day,
HLL_COUNT.INIT(user_id) sketch
FROM
`fake-table`
GROUP BY
1
ORDER BY
1 DESC),

rolling AS (
SELECT
day,
HLL_COUNT.MERGE_PARTIAL(sketch) OVER (ORDER BY UNIX_DATE(day) RANGE BETWEEN 89 PRECEDING AND CURRENT ROW) rolling_sketch
FROM daily)

SELECT
day,
HLL_COUNT.EXTRACT(rolling_sketch)
FROM
rolling
ORDER BY
1

"Image of the error - Function MERGE_PARTIAL is not supported"

有什么想法为什么会发生此错误或如何调整吗?

最佳答案

下面是 BigQuery 标准 SQL,使用窗口函数可以完全满足您的需求

#standardSQL
SELECT day,
(SELECT HLL_COUNT.MERGE(sketch) FROM UNNEST(rolling_sketch_arr) sketch) rolling_sketch
FROM (
SELECT day,
ARRAY_AGG(ids_sketch) OVER(ORDER BY UNIX_DATE(day) RANGE BETWEEN 89 PRECEDING AND CURRENT ROW) rolling_sketch_arr
FROM (
SELECT day, HLL_COUNT.INIT(id) ids_sketch
FROM `project.dataset.table`
GROUP BY day
)
)

您可以使用[完全]虚拟数据来测试、玩上面的内容,如下例所示

#standardSQL
WITH `project.dataset.table` AS (
SELECT 1 id, DATE '2019-01-01' day UNION ALL
SELECT 2, '2019-01-01' UNION ALL
SELECT 3, '2019-01-01' UNION ALL
SELECT 1, '2019-01-02' UNION ALL
SELECT 4, '2019-01-02' UNION ALL
SELECT 2, '2019-01-03' UNION ALL
SELECT 3, '2019-01-03' UNION ALL
SELECT 4, '2019-01-03' UNION ALL
SELECT 5, '2019-01-03' UNION ALL
SELECT 1, '2019-01-04' UNION ALL
SELECT 4, '2019-01-04' UNION ALL
SELECT 2, '2019-01-05' UNION ALL
SELECT 3, '2019-01-05' UNION ALL
SELECT 5, '2019-01-05' UNION ALL
SELECT 6, '2019-01-05'
)
SELECT day,
(SELECT HLL_COUNT.MERGE(sketch) FROM UNNEST(rolling_sketch_arr) sketch) rolling_sketch
FROM (
SELECT day,
ARRAY_AGG(ids_sketch) OVER(ORDER BY UNIX_DATE(day) RANGE BETWEEN 2 PRECEDING AND CURRENT ROW) rolling_sketch_arr
FROM (
SELECT day, HLL_COUNT.INIT(id) ids_sketch
FROM `project.dataset.table`
GROUP BY day
)
)
-- ORDER BY day

结果

Row day         rolling_sketch   
1 2019-01-01 3
2 2019-01-02 4
3 2019-01-03 5
4 2019-01-04 5
5 2019-01-05 6

关于sql - BigQuery : How to merge HLL Sketches over a window function?(在滚动窗口上计算不同值),我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/54815851/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com