gpt4 book ai didi

sql - SELECT .. WHERE NOT IN (SELECT ..) with strings 的性能急剧下降

转载 作者:行者123 更新时间:2023-11-29 11:29:32 25 4
gpt4 key购买 nike

问题:

SELECT new_filename
FROM tmp2_import_lightnings_filenames
WHERE new_filename
NOT IN (SELECT filename FROM service.import_lightnings_filenames LIMIT 64500)
LIMIT 1;

执行时间:62 毫秒

SELECT new_filename
FROM tmp2_import_lightnings_filenames
WHERE new_filename
NOT IN (SELECT filename FROM service.import_lightnings_filenames LIMIT 65000)
LIMIT 1;

执行时间:4.742 秒

(所有LIMITS仅供测试)

巨大的滞后!而且呈指数增长。

表格:

CREATE TABLE public.tmp2_import_lightnings_filenames (
new_filename VARCHAR(63) NOT NULL,
CONSTRAINT tmp2_import_lightnings_filenames_pkey PRIMARY KEY(new_filename)
) WITHOUT OIDS;

表格大小:7304 个字符串

数据示例:/xml/2012-07-13/01/01-24.xml

CREATE TABLE service.import_lightnings_filenames (
id SERIAL,
filename VARCHAR(63) NOT NULL,
imported BOOLEAN DEFAULT false,
strokes_num INTEGER,
CONSTRAINT import_lightnings_filenames_pkey PRIMARY KEY(id)
) WITHOUT OIDS;

CREATE UNIQUE INDEX import_lightnings_filenames_idx
ON service.import_lightnings_filenames
USING btree (filename COLLATE pg_catalog."default");

表格大小:70812个字符串

数据示例:44;/xml/2012-05-26/12/12-18.xml;TRUE;NULL

查询计划:

 Limit  (cost=0.00..2108.11 rows=1 width=29) (actual time=240.183..240.183 rows=1 loops=1)
Buffers: shared hit=539, temp written=307
-> Seq Scan on tmp2_import_lightnings_filenames (cost=0.00..7698823.12 rows=3652 width=29) (actual time=240.181..240.181 rows=1 loops=1)
Filter: (NOT (SubPlan 1))
Buffers: shared hit=539, temp written=307
SubPlan 1
-> Materialize (cost=0.00..1946.82 rows=64500 width=29) (actual time=0.009..198.313 rows=64500 loops=1)
Buffers: shared hit=538, temp written=307
-> Limit (cost=0.00..1183.32 rows=64500 width=29) (actual time=0.005..113.196 rows=64500 loops=1)
Buffers: shared hit=538
-> Seq Scan on import_lightnings_filenames (cost=0.00..1299.12 rows=70812 width=29) (actual time=0.004..42.418 rows=64500 loops=1)
Buffers: shared hit=538
Total runtime: 240.982 ms



Limit (cost=0.00..2125.03 rows=1 width=29) (actual time=30734.619..30734.619 rows=1 loops=1)
Buffers: shared hit=547, temp read=112258 written=669
-> Seq Scan on tmp2_import_lightnings_filenames (cost=0.00..7760626.00 rows=3652 width=29) (actual time=30734.617..30734.617 rows=1 loops=1)
Filter: (NOT (SubPlan 1))
Buffers: shared hit=547, temp read=112258 written=669
SubPlan 1
-> Materialize (cost=0.00..1962.49 rows=65000 width=29) (actual time=0.798..42.306 rows=64820 loops=363)
Buffers: shared hit=543, temp read=112258 written=669
-> Limit (cost=0.00..1192.49 rows=65000 width=29) (actual time=0.005..116.110 rows=65000 loops=1)
Buffers: shared hit=543
-> Seq Scan on import_lightnings_filenames (cost=0.00..1299.12 rows=70812 width=29) (actual time=0.003..43.804 rows=65000 loops=1)
Buffers: shared hit=543
Total runtime: 30735.267 ms

我做错了什么?

最佳答案

性能下降的原因似乎是您用完了 work_mem materialize 步骤开始换出到磁盘。我在这里引用手册:

work_mem (integer)
[...] Hash tables are used in hash joins, hash-based aggregation, and hash-based processing of IN subqueries.

强调我的。通过提高 work_mem 的设置并再次运行您的查询来验证这一点。作为评论中提供的@a_horse,通过调用为当前 session 设置它:

set work_mem = '64MB';

你不需要你的系统管理员。您可以在 session 中重置为默认值:

reset work_mem;

设置将随着 session 结束而消失。更改 postgresql.conf 中的设置(并重新加载)以获得永久效果。

许多 PostgreSQL 软件包都带有非常保守的设置(默认 1MB)。这在很大程度上取决于您的工作负载,但一般来说,在 4 GB 或更大内存的机器上 16 MB 将是最小值。我在具有 12 GB RAM 的专用数据库服务器上使用 64 MB - 只有几个并发用户。

您可能需要对设置进行一些常规调整。这是 general performance optimization in the PostgreSQL Wiki 的指针列表.您还可以在链接后找到有关 work_mem 调整的更多信息。


除此之外,重写您的查询也可能会加快速度。 IN 具有大列表的子查询往往是 PostgreSQL 中最慢的选择。

左连接/为空

SELECT new_filename
FROM tmp2_import_lightnings_filenames t
LEFT JOIN (
SELECT filename
FROM service.import_lightnings_filenames
LIMIT 65000
) x ON t.new_filename = x.filename
WHERE x.filename IS NULL;

不存在

更快,尤其是在 service.import_lightnings_filenames 中有重复:

SELECT new_filename
FROM tmp2_import_lightnings_filenames t
WHERE NOT EXISTS (
SELECT 1
FROM (
SELECT filename
FROM service.import_lightnings_filenames
LIMIT 65000
) x
WHERE t.new_filename = x.filename
);

CTE 相同(可能不会更快,但更易于阅读):

WITH x AS (
SELECT filename
FROM service.import_lightnings_filenames
LIMIT 65000
)
SELECT new_filename
FROM tmp2_import_lightnings_filenames t
WHERE NOT EXISTS (
SELECT 1
FROM x
WHERE t.new_filename = x.filename
);

关于sql - SELECT .. WHERE NOT IN (SELECT ..) with strings 的性能急剧下降,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/11545707/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com