gpt4 book ai didi

ruby-on-rails - 使用索引为大量数据优化 Postgres 查询

转载 作者:行者123 更新时间:2023-11-29 12:19:25 26 4
gpt4 key购买 nike

我有一个数据库,posts,里面有大约 2000 万行。我正在尝试使用以下查询缩小分页列表的帖子:

SELECT "posts".* FROM "posts"
WHERE "posts"."source_id" IN (14790, 14787, 32928, 14796, 14791, 15503, 14789, 14772, 15506, 14794, 15543, 31615, 15507, 15508, 14800)
AND "posts"."deleted_at" IS NULL
ORDER BY external_created_at desc LIMIT 100 OFFSET 0;

(约有330万行匹配查询中的source_id)

当我这样做时,大约需要 60 秒,我得到以下 EXPLAIN ANALYZE ( see on depesz ):

EXPLAIN ANALYZE SELECT  "posts".* FROM "posts"  WHERE "posts"."source_id" IN (14790, 14787, 32928, 14796, 14791, 15503, 14789, 14772, 15506, 14794, 15543, 31615, 15507, 15508, 14800) AND "posts"."deleted_at" IS NULL O
RDER BY external_created_at desc LIMIT 100 OFFSET 0;
QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=2530223.38..2530223.63 rows=100 width=1040) (actual time=66564.583..66564.616 rows=100 loops=1)
-> Sort (cost=2530223.38..2534981.19 rows=1903125 width=1040) (actual time=66564.571..66564.594 rows=100 loops=1)
Sort Key: external_created_at
Sort Method: top-N heapsort Memory: 89kB
-> Bitmap Heap Scan on posts (cost=35499.76..2457487.31 rows=1903125 width=1040) (actual time=279.640..64496.330 rows=1674072 loops=1)
Recheck Cond: ((source_id = ANY ('{14790,14787,32928,14796,14791,15503,14789,14772,15506,14794,15543,31615,15507,15508,14800}'::integer[])) AND (deleted_at IS NULL))
Rows Removed by Index Recheck: 4640188
-> Bitmap Index Scan on index_on_posts_partial_source_id_with_order (cost=0.00..35023.98 rows=1903125 width=0) (actual time=275.922..275.922 rows=1674072 loops=1)
Index Cond: (source_id = ANY ('{14790,14787,32928,14796,14791,15503,14789,14772,15506,14794,15543,31615,15507,15508,14800}'::integer[]))
Total runtime: 66564.962 ms
(10 rows)

这是它正在使用的索引:

CREATE INDEX index_on_posts_partial_source_id_with_order ON posts USING btree (source_id) WHERE (deleted_at IS NULL);

Recheck Cond 似乎是此查询中最慢的部分。我看到的关于重新检查条件的所有内容都涉及增加 postgres 使用的内存,因为数据是“lossy”,但我在查询计划中没有看到类似的内容。

关于如何加快速度的任何建议?

似乎以某种方式摆脱重新检查,或者以某种方式通过 external_created_at 订购将是我最好的选择。

编辑:我使用的是 postgres 版本 9.3.4。这是帖子表:

CREATE TABLE posts (
id integer NOT NULL,
source_id integer,
message text,
image text,
external_id text,
created_at timestamp without time zone,
updated_at timestamp without time zone,
external text,
like_count integer DEFAULT 0 NOT NULL,
comment_count integer DEFAULT 0 NOT NULL,
external_created_at timestamp without time zone,
deleted_at timestamp without time zone,
poster_name character varying(255),
poster_image text,
poster_url character varying(255),
poster_id text,
"position" integer,
location character varying(255),
description text,
video text,
rejected_at timestamp without time zone,
deleted_by character varying(255),
height integer,
width integer
);

最佳答案

您的查询正在为分页列表返回几百万行。仔细想想为那么多页面返回数据的智慧。另外,认真考虑是否需要所有列。我怀疑你这样做。

我构建了一个粗略的表并向其中插入了大约 1000 万个随机(ish)行。我使用 PostgreSQL 9.4 的查询计划与您的大致相似。

"Limit  (cost=138609.10..138609.35 rows=100 width=24) (actual time=1410.012..1410.038 rows=100 loops=1)""  ->  Sort  (cost=138609.10..140344.25 rows=694059 width=24) (actual time=1410.010..1410.026 rows=100 loops=1)""        Sort Key: external_created_at""        Sort Method: top-N heapsort  Memory: 29kB""        ->  Bitmap Heap Scan on posts  (cost=12217.47..112082.66 rows=694059 width=24) (actual time=374.393..919.687 rows=3000000 loops=1)""              Recheck Cond: ((source_id = ANY ('{14790,14787,32928,14796,14791,15503,14789,14772,15506,14794,15543,31615,15507,15508,14800}'::integer[])) AND (deleted_at IS NULL))""              Heap Blocks: exact=16217""              ->  Bitmap Index Scan on index_on_posts_partial_source_id_with_order  (cost=0.00..12043.95 rows=694059 width=0) (actual time=370.593..370.593 rows=3000000 loops=1)""                    Index Cond: (source_id = ANY ('{14790,14787,32928,14796,14791,15503,14789,14772,15506,14794,15543,31615,15507,15508,14800}'::integer[]))""Planning time: 0.264 ms""Execution time: 1410.097 ms"

Adding an index to external_created_at dropped the execution time by a factor of about 470. But I don't have the same distribution of values that you have.

create index on test.posts (external_created_at);
analyze test.posts;
explain analyze
select * from test.posts
where source_id in (14790, 14787, 32928, 14796, 14791, 15503, 14789, 14772, 15506, 14794, 15543, 31615, 15507, 15508, 14800)
and deleted_at is null
order by external_created_at desc limit 100 offset 0;
"Limit  (cost=0.43..131.43 rows=100 width=24) (actual time=0.219..2.992 rows=100 loops=1)""  ->  Index Scan Backward using posts_external_created_at_idx on posts  (cost=0.43..900991.48 rows=687808 width=24) (actual time=0.216..2.976 rows=100 loops=1)""        Filter: ((deleted_at IS NULL) AND (source_id = ANY ('{14790,14787,32928,14796,14791,15503,14789,14772,15506,14794,15543,31615,15507,15508,14800}'::integer[])))""        Rows Removed by Filter: 350""Planning time: 0.302 ms""Execution time: 3.024 ms"

关于ruby-on-rails - 使用索引为大量数据优化 Postgres 查询,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/34483793/

26 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com