gpt4 book ai didi

sql - Redshift GROUP BY 时间间隔

转载 作者:行者123 更新时间:2023-12-02 00:48:21 25 4
gpt4 key购买 nike

目前,我在 redshift 中有以下原始数据。

timestamp                   ,lead
==================================
"2008-04-09 10:02:01.000000",true
"2008-04-09 10:03:05.000000",true
"2008-04-09 10:31:07.000000",true
"2008-04-09 11:00:05.000000",false
...

所以,我想生成一个聚合数据,间隔为 30 分钟。我希望的结果是

timestamp                   ,count
==================================
"2008-04-09 10:00:00.000000",2
"2008-04-09 10:30:00.000000",1
"2008-04-09 11:00:00.000000",0
...

我提到了 https://stackoverflow.com/a/12046382/3238864 , 这对 PostgreSQL 有效。

我尝试通过使用来模仿发布的代码

with thirty_min_intervals as (
select
(select min(timestamp)::date from events) + ( n || ' minutes')::interval start_time,
(select min(timestamp)::date from events) + ((n+30) || ' minutes')::interval end_time
from generate_series(0, (24*60), 30) n
)
select count(CASE WHEN lead THEN 1 END) from events e
right join thirty_min_intervals f
on e.timestamp >= f.start_time and e.timestamp < f.end_time
group by f.start_time, f.end_time
order by f.start_time;

但是,我遇到了错误

[0A000] ERROR: Specified types or functions (one per INFO message) not supported on Redshift tables.

请问,redshift 中 N 区间聚合数据计算的好方法是什么。

最佳答案

Joe 的回答是一个非常巧妙的解决方案。我觉得当你在 Redshift 工作时,应该始终考虑数据是如何分布和排序的。它会对性能产生巨大影响。

基于 Joe 的出色回答:我将具体化示例事件。实际上,事件将在表格中。

drop table if exists public.temporary_events;
create table public.temporary_events AS
select ts::timestamp as ts
,lead
from
( SELECT '2017-02-16 10:02:01'::timestamp as ts, true::boolean as lead
UNION ALL SELECT '2017-02-16 10:03:05'::timestamp as ts, true::boolean as lead
UNION ALL SELECT '2017-02-16 10:31:07'::timestamp as ts, true::boolean as lead
UNION ALL SELECT '2017-02-16 11:00:05'::timestamp as ts, false::boolean as lead)
;

现在运行解释:

explain 
WITH time_dimension
AS (SELECT dtm
,dtm - ((DATEPART(SECONDS,dtm) + (DATEPART(MINUTES,dtm)*60) % 1800) * INTERVAL '1 second') AS dtm_half_hour
FROM /* Create a series of timestamp. 1 per second working backwards from NOW(). */
/* NB: `sysdate` could be substituted for an arbitrary ending timestamp */
(SELECT DATE_TRUNC('SECONDS',sysdate) - (n * INTERVAL '1 second') AS dtm
FROM /* Generate a number sequence of 100,000 values from a large internal table */
(SELECT ROW_NUMBER() OVER () AS n FROM stl_scan LIMIT 100000) rn) rn)

SELECT dtm_half_hour
,COUNT(CASE WHEN lead THEN 1 END)
FROM time_dimension td
LEFT JOIN public.temporary_events e
ON td.dtm = e.ts
WHERE td.dtm_half_hour BETWEEN '2017-02-16 09:30:00' AND '2017-02-16 11:00:00'
GROUP BY 1
-- ORDER BY 1 Just to simply the job a little

输出是:

XN HashAggregate  (cost=999999999999999967336168804116691273849533185806555472917961779471295845921727862608739868455469056.00..999999999999999967336168804116691273849533185806555472917961779471295845921727862608739868455469056.00 rows=1 width=9)
-> XN Hash Left Join DS_DIST_BOTH (cost=0.05..999999999999999967336168804116691273849533185806555472917961779471295845921727862608739868455469056.00 rows=1 width=9)
Outer Dist Key: ('2018-11-27 17:00:35'::timestamp without time zone - ((rn.n)::double precision * '00:00:01'::interval))
Inner Dist Key: e."ts"
Hash Cond: ("outer"."?column2?" = "inner"."ts")
-> XN Subquery Scan rn (cost=0.00..14.95 rows=1 width=8)
Filter: (((('2018-11-27 17:00:35'::timestamp without time zone - ((n)::double precision * '00:00:01'::interval)) - ((((("date_part"('minutes'::text, ('2018-11-27 17:00:35'::timestamp without time zone - ((n)::double precision * '00:00:01'::interval))) * 60) % 1800) + "date_part"('seconds'::text, ('2018-11-27 17:00:35'::timestamp without time zone - ((n)::double precision * '00:00:01'::interval)))))::double precision * '00:00:01'::interval)) <= '2017-02-16 11:00:00'::timestamp without time zone) AND ((('2018-11-27 17:00:35'::timestamp without time zone - ((n)::double precision * '00:00:01'::interval)) - ((((("date_part"('minutes'::text, ('2018-11-27 17:00:35'::timestamp without time zone - ((n)::double precision * '00:00:01'::interval))) * 60) % 1800) + "date_part"('seconds'::text, ('2018-11-27 17:00:35'::timestamp without time zone - ((n)::double precision * '00:00:01'::interval)))))::double precision * '00:00:01'::interval)) >= '2017-02-16 09:30:00'::timestamp without time zone))
-> XN Limit (cost=0.00..1.95 rows=130 width=0)
-> XN Window (cost=0.00..1.95 rows=130 width=0)
-> XN Network (cost=0.00..1.30 rows=130 width=0)
Send to slice 0
-> XN Seq Scan on stl_scan (cost=0.00..1.30 rows=130 width=0)
-> XN Hash (cost=0.04..0.04 rows=4 width=9)
-> XN Seq Scan on temporary_events e (cost=0.00..0.04 rows=4 width=9)

卡布拉莫!

正如 Joe 所说,您可以毫无问题地愉快地使用此模式。然而,一旦您的数据变得足够大或您的 SQL 逻辑变得复杂,您可能需要优化。如果没有其他原因,您可能希望在代码中添加更多 sql 逻辑时了解解释计划。

我们可以关注三个方面:

  1. 加入。使两组数据之间的连接在相同的数据类型上工作。在这里,我们将时间戳连接到一个时间间隔。
  2. 数据分发。按时间戳具体化和分发这两个表。
  3. 数据排序。如果事件按此时间戳排序并且时间维度按两个时间戳排序,那么您可以使用合并连接完成整个查询,而无需移动任何数据,也无需将数据发送到领导节点进行聚合。

观察:

drop table if exists public.temporary_time_dimension;
create table public.temporary_time_dimension
distkey(dtm) sortkey(dtm, dtm_half_hour)
AS (SELECT dtm::timestamp as dtm
,dtm - ((DATEPART(SECONDS,dtm) + (DATEPART(MINUTES,dtm)*60) % 1800) * INTERVAL '1 second') AS dtm_half_hour
FROM /* Create a series of timestamp. 1 per second working backwards from NOW(). */
/* NB: `sysdate` could be substituted for an arbitrary ending timestamp */
(SELECT DATE_TRUNC('SECONDS',sysdate) - (n * INTERVAL '1 second') AS dtm
FROM /* Generate a number sequence of 100,000 values from a large internal table */
(SELECT ROW_NUMBER() OVER () AS n FROM stl_scan LIMIT 100000) rn) rn)
;

drop table if exists public.temporary_events;
create table public.temporary_events
distkey(ts) sortkey(ts)
AS
select ts::timestamp as ts
,lead
from
( SELECT '2017-02-16 10:02:01'::timestamp as ts, true::boolean as lead
UNION ALL SELECT '2017-02-16 10:03:05'::timestamp as ts, true::boolean as lead
UNION ALL SELECT '2017-02-16 10:31:07'::timestamp as ts, true::boolean as lead
UNION ALL SELECT '2017-02-16 11:00:05'::timestamp as ts, false::boolean as lead)
;

explain
SELECT
dtm_half_hour
,COUNT(CASE WHEN lead THEN 1 END)
FROM public.temporary_time_dimension td
LEFT JOIN public.temporary_events e
ON td.dtm = e.ts
WHERE td.dtm_half_hour BETWEEN '2017-02-16 09:30:00' AND '2017-02-16 11:00:00'
GROUP BY 1
--order by dtm_half_hour

然后给出:

XN HashAggregate  (cost=1512.67..1512.68 rows=1 width=9)
-> XN Merge Left Join DS_DIST_NONE (cost=0.00..1504.26 rows=1682 width=9)
Merge Cond: ("outer".dtm = "inner"."ts")
-> XN Seq Scan on temporary_time_dimension td (cost=0.00..1500.00 rows=1682 width=16)
Filter: ((dtm_half_hour <= '2017-02-16 11:00:00'::timestamp without time zone) AND (dtm_half_hour >= '2017-02-16 09:30:00'::timestamp without time zone))
-> XN Seq Scan on temporary_events e (cost=0.00..0.04 rows=4 width=9)

重要注意事项:

  • 我已经下单了。将其放回将导致数据被发送到领导节点进行排序。如果你能取消排序,那就取消排序!
  • 我敢肯定,在许多情况下,选择时间戳作为事件表排序键并不理想。我只是想展示什么是可能的。
  • 我认为您可能希望使用 diststyle all 创建并排序的时间维度。这将确保您的加入不会产生网络流量。

关于sql - Redshift GROUP BY 时间间隔,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/42063003/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com