gpt4 book ai didi

sql - PostgreSQL 窗口函数 : partition by comparison

转载 作者:行者123 更新时间:2023-11-29 11:31:08 26 4
gpt4 key购买 nike

我试图在 PostgreSQL 查询的 WINDOW 函数中找到与 PARTITION BY 子句中的当前行进行比较的方法。

假设我在以下查询中有这 5 个元素的短列表(在实际情况下,我有数千甚至数百万行)。我试图为每一行获取下一个不同元素(事件列)的 ID,以及前一个不同元素的 ID。

WITH events AS(
SELECT 1 as id, 12 as event, '2014-03-19 08:00:00'::timestamp as date
UNION SELECT 2 as id, 12 as event, '2014-03-19 08:30:00'::timestamp as date
UNION SELECT 3 as id, 13 as event, '2014-03-19 09:00:00'::timestamp as date
UNION SELECT 4 as id, 13 as event, '2014-03-19 09:30:00'::timestamp as date
UNION SELECT 5 as id, 12 as event, '2014-03-19 10:00:00'::timestamp as date
)
SELECT lag(id) over w as previous_different, event
, lead(id) over w as next_different
FROM events ev
WINDOW w AS (PARTITION BY event!=ev.event ORDER BY date ASC);

我知道比较 event!=ev.event 是不正确的,但这就是我想要达到的目的。

我得到的结果是(就像我删除 PARTITION BY 子句一样):

 |12|2
1|12|3
2|13|4
3|13|5
4|12|

我想得到的结果是:

 |12|3
|12|3
2|13|5
2|13|5
4|12|

任何人都知道这是否可能以及如何实现?非常感谢!

编辑: 我知道我可以用两个 JOIN,一个 ORDER BY 和一个 DISTINCT ON,但在数百万行的实际情况下,它的效率非常低:

WITH events AS(
SELECT 1 as id, 12 as event, '2014-03-19 08:00:00'::timestamp as date
UNION SELECT 2 as id, 12 as event, '2014-03-19 08:30:00'::timestamp as date
UNION SELECT 3 as id, 13 as event, '2014-03-19 09:00:00'::timestamp as date
UNION SELECT 4 as id, 13 as event, '2014-03-19 09:30:00'::timestamp as date
UNION SELECT 5 as id, 12 as event, '2014-03-19 10:00:00'::timestamp as date
)
SELECT DISTINCT ON (e.id, e.date) e1.id, e.event, e2.id
FROM events e
LEFT JOIN events e1 ON (e1.date<=e.date AND e1.id!=e.id AND e1.event!=e.event)
LEFT JOIN events e2 ON (e2.date>=e.date AND e2.id!=e.id AND e2.event!=e.event)
ORDER BY e.date ASC, e.id ASC, e1.date DESC, e1.id DESC, e2.date ASC, e2.id ASC

最佳答案

使用几个不同的 window functions和两个子查询,这应该工作得相当快:

WITH events(id, event, ts) AS (
VALUES
(1, 12, '2014-03-19 08:00:00'::timestamp)
,(2, 12, '2014-03-19 08:30:00')
,(3, 13, '2014-03-19 09:00:00')
,(4, 13, '2014-03-19 09:30:00')
,(5, 12, '2014-03-19 10:00:00')
)
SELECT first_value(pre_id) OVER (PARTITION BY grp ORDER BY ts) AS pre_id
, id, ts
, first_value(post_id) OVER (PARTITION BY grp ORDER BY ts DESC) AS post_id
FROM (
SELECT *, count(step) OVER w AS grp
FROM (
SELECT id, ts
, NULLIF(lag(event) OVER w, event) AS step
, lag(id) OVER w AS pre_id
, lead(id) OVER w AS post_id
FROM events
WINDOW w AS (ORDER BY ts)
) sub1
WINDOW w AS (ORDER BY ts)
) sub2
ORDER BY ts;

使用 ts 作为时间戳列的名称。
假设 ts 是唯一的 - 并且索引(unique constraint 会自动执行)。

在一个包含 50k 行的真实表的测试中,它只需要一次索引扫描。所以,即使是大 table ,也应该相当快。相比之下,您使用 join/distinct 的查询在一分钟后没有完成(如预期的那样)。
即使是一次处理一个交叉连接的优化版本(几乎没有限制条件的左连接实际上是一个有限的交叉连接)也不会在一分钟后完成。

为了获得大表的最佳性能,请调整内存设置,特别是 work_mem (用于大排序操作)。如果您可以节省 RAM,请考虑暂时为您的 session 将其设置(高得多)。阅读更多 herehere .

如何?

  1. 在子查询 sub1 中,查看前一行的事件,只有在它发生变化时才保留它,从而标记新组的第一个元素。同时获取上一行和下一行的id(pre_id,post_id)。

  2. 在子查询sub2中,count()只计算非空值。生成的 grp 标记连续相同事件 block 中的对等点。

  3. 在最后的 SELECT 中,为每一行取每组的第一个 pre_id 和最后一个 post_id 以达到所需的结果。
    实际上,这在外部 SELECT 中应该更快:

     last_value(post_id) OVER (PARTITION BY grp ORDER BY ts
    RANGE BETWEEN UNBOUNDED PRECEDING
    AND UNBOUNDED FOLLOWING) AS post_id

    ...因为窗口的排序顺序与pre_id 的窗口一致,所以只需要一次排序。快速测试似乎可以证实这一点。 More about this frame definition.

SQL Fiddle.

关于sql - PostgreSQL 窗口函数 : partition by comparison,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/22512335/

26 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com