gpt4 book ai didi

sql - 使用时间戳 OVERLAPS 和 "PARTITION BY"加速 PostgreSQL 查询

转载 作者:行者123 更新时间:2023-11-29 11:34:41 25 4
gpt4 key购买 nike

我在 PostgreSQL 9.0 中有一个相当大的表(500K - 1M 行),其中包含通用的“时间片”信息,也就是说,它确定另一个表中的行(“功能”)何时有效。定义看起来像这样(稍微简化):

CREATE TABLE feature_timeslice
(
timeslice_id int NOT NULL,
feature_id int NOT NULL,
valid_time_begin timestamp NOT NULL,
valid_time_end timestamp,
sequence_number smallint,
-- Some other columns
CONSTRAINT pk_feature_timeslice PRIMARY KEY (timeslice_id)
-- Some other constraints
)

CREATE INDEX ix_feature_timeslice_feature_id
ON feature_timeslice USING btree (feature_id);

然后在 timeslice_id 上将特定功能的许多其他表连接到它:

CREATE TABLE specific_feature_timeslice
(
timeslice_id int NOT NULL,
-- Other columns
CONSTRAINT pk_specific_feature_timeslice PRIMARY KEY (timeslice_id),
CONSTRAINT fk_specific_feature_timeslice_feature_timeslice FOREIGN KEY (timeslice_id) REFERENCES feature_timeslice (timeslice_id)
)

可能有多个时间片具有重叠的有效时间(开始/结束时间),但具有最高 sequence_number 的时间片优先(同样,稍微简化,但足够接近)。我想有效地找到每个 feature_id 的当前有效行,所以我定义了一个 View ,如下所示:

CREATE VIEW feature_timeslice_id_now
AS
SELECT timeslice_id
FROM
(
SELECT timeslice_id, rank() OVER
(
PARTITION BY feature_id
ORDER BY sequence_number DESC, timeslice_id DESC
)
FROM feature_timeslice
WHERE (current_timestamp AT TIME ZONE 'UTC', '0'::interval) OVERLAPS (valid_time_begin, COALESCE(valid_time_end, 'infinity'::timestamp))
) subq
WHERE subq.rank = 1

通常这样查询:

SELECT *
FROM specific_feature_timeslice sf
JOIN feature_timeslice_id_now n USING (timeslice_id)
WHERE sf.name = 'SOMETHING'

这可行,但仍然有点太慢 - 需要 1-2 秒,即使可能只返回 1-5 行,因为 specific_feature_timeslice 标准通常会大大缩小它的范围。 (加入多个功能 View 的更复杂的查询会很快变得非常慢。)我不知道如何让 PostgreSQL 更有效地执行此操作。查询计划如下所示:

   Join Filter: ((r.timeslice_id)::integer = (subq.timeslice_id)::integer)
-> Subquery Scan on subq (cost=32034.36..37876.98 rows=835 width=4) (actual time=2086.125..5243.467 rows=250918 loops=1)
Filter: (subq.rank = 1)
-> WindowAgg (cost=32034.36..35790.33 rows=166932 width=10) (actual time=2086.110..4066.351 rows=250918 loops=1)
-> Sort (cost=32034.36..32451.69 rows=166932 width=10) (actual time=2086.065..2654.971 rows=250918 loops=1)
Sort Key: feature_timeslice.feature_id, feature_timeslice.sequence_number, feature_timeslice.timeslice_id
Sort Method: quicksort Memory: 13898kB
-> Seq Scan on feature_timeslice (cost=0.00..17553.93 rows=166932 width=10) (actual time=287.270..1225.595 rows=250918 loops=1)
Filter: overlaps(timezone('UTC'::text, now()), (timezone('UTC'::text, now()) + '00:00:00'::interval), (valid_time_begin)::timestamp without time zone, COALESCE((valid_time_end)::timestamp without time zone, 'infinity'::timestamp without time zone))
-> Materialize (cost=0.00..1093.85 rows=2 width=139) (actual time=0.002..0.007 rows=2 loops=250918)
-> Seq Scan on specific_feature_timeslice sf (cost=0.00..1093.84 rows=2 width=139) (actual time=1.958..7.674 rows=2 loops=1)
Filter: ((name)::text = 'SOMETHING'::text)
Total runtime: 10319.875 ms

实际上,我想在任何给定时间执行此查询,而不仅仅是当前时间。我为此定义了一个函数,它将时间作为参数,但查询“现在”是最常见的情况,所以即使我只能加快速度,这也是一个很大的改进。

== 编辑 ==

好的,我已尝试按照两个答案的建议对表进行规范化 - 也就是说,我将 valid_time_begin 和 valid_time_end 移到单独的表 time_period 中。我还将窗口函数替换为 WHERE NOT EXISTS ([better candidate time slice])。在这个过程中我也升级到了 PostgreSQL 9.1。尽管如此,现在一些查询的速度是原来的两倍。查询计划看起来与 wildplasser 的回答相同。这很好,但不如我希望的那么好 - 从单个功能表中进行选择仍然需要一秒钟多的时间。

理想情况下,我想利用特征 WHERE 条件的选择性,如 Erwin Brandstetter 所说。如果我手工制作一个查询来执行此操作,我得到的时间是 15-30 毫秒。现在更像了!手工制作的查询看起来像这样:

WITH filtered_feature AS
(
SELECT *
FROM specific_feature_timeslice sf
JOIN feature_timeslice ft USING (timeslice_id)
WHERE sf.name = 'SOMETHING'
)
SELECT *
FROM filtered_feature ff
JOIN
(
SELECT timeslice_id
FROM filtered_feature candidate
JOIN time_period candidate_time ON candidate.valid_time_period_id = candidate_time.id
WHERE ('2011-09-26', '0'::interval) OVERLAPS (candidate_time.valid_time_begin, COALESCE(candidate_time.valid_time_end, 'infinity'::timestamp))
AND NOT EXISTS
(
SELECT *
FROM filtered_feature better
JOIN time_period better_time ON better.valid_time_period_id = better_time.id
WHERE ('2011-09-26', '0'::interval) OVERLAPS (better_time.valid_time_begin, COALESCE(better_time.valid_time_end, 'infinity'::timestamp))
AND better.feature_id = candidate.feature_id AND better.timeslice_id != candidate.timeslice_id
AND better.sequence_number > candidate.sequence_number
)
) AS ft ON ff.timeslice_id = ft.timeslice_id

不幸的是,这对于在可能连接许多其他表的普通查询中使用来说太大太复杂了。我需要一些方法将这个逻辑封装在一个函数(任意时间)或至少一个 View (当前时间)中,但我无法弄清楚如何做到这一点,同时仍然让查询规划器首先过滤特定功能。要是我能将一个行集传递​​给一个函数就好了——但据我所知,PostgreSQL 不允许这样做。有什么想法吗?

==结论==

我最终使用 PostgreSQL 继承来解决这个问题(见我的回答),但如果不是 Erwin Brandstetter 的回答,我不会想出这个主意,所以赏金去了他。 wildplasser 的回答也很有帮助,因为它让我消除了不必要的窗口功能,从而进一步加快了速度。非常感谢你们!

最佳答案

我最终使用 PostgreSQL 继承来解决这个问题,因此每个 specific_feature_timeslice 表都继承自 feature_timeslice(而不是像以前那样引用它)。这允许“功能的选择性可以首先生效”——查询计划首先将其缩小到我想要的几行。所以架构现在看起来像这样:

CREATE TABLE feature_timeslice
(
timeslice_id int NOT NULL,
feature_id int NOT NULL,
valid_time_begin timestamp NOT NULL,
valid_time_end timestamp,
sequence_number smallint,
-- Some other columns
CONSTRAINT pk_feature_timeslice PRIMARY KEY (timeslice_id)
-- Some other constraints
)

CREATE TABLE specific_feature_timeslice
(
-- Feature-specific columns only, eg.
name character varying(100),

CONSTRAINT pk_specific_feature_timeslice PRIMARY KEY (timeslice_id)
)
INHERITS (feature_timeslice);

CREATE INDEX ix_specific_feature_timeslice_feature_id
ON specific_feature_timeslice (feature_id);

每个这样的派生表都有自己的函数来选择指定时间的当前行:

CREATE FUNCTION specific_feature_asof(effective_time timestamp)
RETURNS SETOF specific_feature_timeslice
AS $BODY$
SELECT candidate.*
FROM specific_feature_timeslice candidate
WHERE ($1, '0'::interval) OVERLAPS (candidate.valid_time_begin, COALESCE(candidate.valid_time_end, 'infinity'::timestamp))
AND NOT EXISTS
(
SELECT *
FROM specific_feature_timeslice better
WHERE ($1, '0'::interval) OVERLAPS (better.valid_time_begin, COALESCE(better.valid_time_end, 'infinity'::timestamp))
AND better.feature_id = candidate.feature_id AND better.timeslice_id != candidate.timeslice_id AND better.sequence_number > candidate.sequence_number
)
$BODY$ LANGUAGE SQL STABLE;

当然,我会自动生成这些函数 - 除了表名外,它们是相同的。典型的查询就变成了:

SELECT *
FROM specific_feature_asof('2011-09-30')
WHERE name = 'SOMETHING'

查询计划如下所示:

Nested Loop Anti Join  (cost=0.00..412.84 rows=3 width=177) (actual time=0.044..7.038 rows=10 loops=1)
Join Filter: (((better.timeslice_id)::integer <> (candidate.timeslice_id)::integer) AND ((better.sequence_number)::smallint > (candidate.sequence_number)::smallint))
-> Seq Scan on specific_feature_timeslice candidate (cost=0.00..379.66 rows=3 width=177) (actual time=0.018..6.688 rows=10 loops=1)
Filter: (((name)::text = 'SOMETHING'::text) AND overlaps(('2011-09-30 00:00:00'::timestamp without time zone)::timestamp without time zone, (('2011-09-30 00:00:00'::timestamp without time zone)::timestamp without time zone + '00:00:00'::interval), (valid_time_begin)::timestamp without time zone, COALESCE((valid_time_end)::timestamp without time zone, 'infinity'::timestamp without time zone)))
-> Index Scan using ix_specific_feature_timeslice_feature_id on specific_feature_timeslice better (cost=0.00..8.28 rows=1 width=14) (actual time=0.008..0.011 rows=1 loops=10)
Index Cond: ((feature_id)::integer = (candidate.feature_id)::integer)
Filter: overlaps(('2011-09-30 00:00:00'::timestamp without time zone)::timestamp without time zone, (('2011-09-30 00:00:00'::timestamp without time zone)::timestamp without time zone + '00:00:00'::interval), (valid_time_begin)::timestamp without time zone, COALESCE((valid_time_end)::timestamp without time zone, 'infinity'::timestamp without time zone))
Total runtime: 7.150 ms

性能差异非常显着:像上面的查询这样的简单选择需要 30-60 毫秒。连接两个这样的函数最多需要 300-400 毫秒,这比我预期的要多一些,但仍然可以接受。

有了这些变化,我认为不再需要规范化 feature_timeslice,即。将有效的开始/结束时间提取到一个单独的表中,所以我没有这样做。

关于sql - 使用时间戳 OVERLAPS 和 "PARTITION BY"加速 PostgreSQL 查询,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/7495239/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com