gpt4 book ai didi

PostgreSQL 使用 pg_trgm 比全扫描慢

转载 作者:行者123 更新时间:2023-11-29 14:05:42 26 4
gpt4 key购买 nike

我在玩 pg_trgm 扩展,我有点困惑。这是 session :

postgres=# create table t(i int, x text);
CREATE TABLE
postgres=# insert into t select i, random()::text from generate_series(1,50000000) as i;
INSERT 0 50000000
postgres=# explain analyze select * from t where x ilike '%666666%';
QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------
Gather (cost=1000.00..531870.29 rows=12954 width=36) (actual time=131.436..11408.176 rows=432 loops=1)
Workers Planned: 2
Workers Launched: 2
-> Parallel Seq Scan on t (cost=0.00..529574.89 rows=5398 width=36) (actual time=108.771..11304.946 rows=144 loops=3)
Filter: (x ~~* '%666666%'::text)
Rows Removed by Filter: 16666523
Planning Time: 0.121 ms
Execution Time: 11408.279 ms
(8 rows)

postgres=# explain analyze select * from t where x ilike '%666666%';
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------
Gather (cost=1000.00..580654.94 rows=5000 width=21) (actual time=124.986..11070.983 rows=432 loops=1)
Workers Planned: 2
Workers Launched: 2
-> Parallel Seq Scan on t (cost=0.00..579154.94 rows=2083 width=21) (actual time=72.207..11010.876 rows=144 loops=3)
Filter: (x ~~* '%666666%'::text)
Rows Removed by Filter: 16666523
Planning Time: 0.283 ms
Execution Time: 11071.065 ms
(8 rows)

postgres=# create index i on t using gin (x gin_trgm_ops);
CREATE INDEX
postgres=# analyze t;
ANALYZE
postgres=# explain analyze select * from t where x ilike '%666666%';
QUERY PLAN
---------------------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on t (cost=54.75..18107.93 rows=5000 width=21) (actual time=116.114..26995.773 rows=432 loops=1)
Recheck Cond: (x ~~* '%666666%'::text)
Rows Removed by Index Recheck: 36257910
Heap Blocks: exact=39064 lossy=230594
-> Bitmap Index Scan on i (cost=0.00..53.50 rows=5000 width=0) (actual time=75.363..75.363 rows=592216 loops=1)
Index Cond: (x ~~* '%666666%'::text)
Planning Time: 0.389 ms
Execution Time: 26996.429 ms
(8 rows)

postgres=# explain analyze select * from t where x ilike '%666666%';
QUERY PLAN
---------------------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on t (cost=54.75..18107.93 rows=5000 width=21) (actual time=128.859..29231.765 rows=432 loops=1)
Recheck Cond: (x ~~* '%666666%'::text)
Rows Removed by Index Recheck: 36257910
Heap Blocks: exact=39064 lossy=230594
-> Bitmap Index Scan on i (cost=0.00..53.50 rows=5000 width=0) (actual time=79.147..79.147 rows=592216 loops=1)
Index Cond: (x ~~* '%666666%'::text)
Planning Time: 0.252 ms
Execution Time: 29231.945 ms
(8 rows)

如您所见,在没有索引的情况下,查询比使用索引快两倍。目前只有默认的 PostgreSQL 设置(共享缓冲区、工作内存等)

我错过了什么?

PS: PostgreSQL 11.5 (Ubuntu 11.5-1.pgdg18.04+1) on x86_64-pc-linux-gnu, gcc (Ubuntu 7.4.0-1ubuntu1~18.04.1) 7.4.0, 64-bit 编译

PPS:使用 gist 索引会更慢。

最佳答案

tldr:trigrams 可能不擅长搜索由重复 N 次的单个字符组成的模式(例如 666666),因为只存在 1 个非终结符 trigram,可以 在搜索空间中出现率很高。

当使用 gin-index 时,行的位图太大而无法放入内存,因此它存储对页面的引用,数据库必须对这些页面执行进一步的重新检查扫描。如果重新检查的页面数量很少,索引的使用仍然是有益的,但是如果重新检查的页面数量很多,索引的性能就会很差。这在您的解释输出中由以下几行突出显示

   Recheck Cond: (x ~~* '%666666%'::text)
Rows Removed by Index Recheck: 36257910
Heap Blocks: exact=39064 lossy=230594

关于测试数据,此问题特定于您的搜索字符串,即 666666

如果你运行 select pg_trgm('666666'),你会发现:

        show_trgm        
-------------------------
{" 6"," 66","66 ",666}
(1 row)

甚至不会在相似的上下文中生成前 3 个三元组 (用户 jjanes 建议的更正)。搜索索引会得到所有包含 666 的页面。您可以通过使用 ... ilike '%666%' 运行 explain analyze 查询来验证这一点,并获得与上面相同的 Heap Blocks 输出。

如果您使用模式 123456 进行搜索,您会发现它的性能要好得多,因为它会生成更大的 trigram 集来进行搜索:

              show_trgm              
-------------------------------------
{" 1"," 12",123,234,345,456,"56 "}
(1 row)

在我的机器上,我得到以下信息:

|------------------------------------|
| pattern | pages rechecked |
| | exact | lossy | total |
|------------------------------------|
| 123456 | 600 | | 600 |
| 666666 | 39454 | 230592 | 270046* |
| 666 | 39454 | 230592 | 270046* |
|------------------------------------|
*this is rougly 85% of the total # of pages used for the table 't'

这是解释输出:

postgres=> explain analyze select * from t where x ~ '123456';
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on t (cost=90.75..18143.92 rows=5000 width=22) (actual time=110.962..113.509 rows=518 loops=1)
Recheck Cond: (x ~ '123456'::text)
Rows Removed by Index Recheck: 83
Heap Blocks: exact=600
-> Bitmap Index Scan on t_x_idx (cost=0.00..89.50 rows=5000 width=0) (actual time=110.868..110.868 rows=601 loops=1)
Index Cond: (x ~ '123456'::text)
Planning time: 0.703 ms
Execution time: 113.564 ms
(8 rows)

postgres=> explain analyze select * from t where x ~ '666666';
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on t (cost=54.75..18107.92 rows=5000 width=22) (actual time=137.143..18111.609 rows=462 loops=1)
Recheck Cond: (x ~ '666666'::text)
Rows Removed by Index Recheck: 36258389
Heap Blocks: exact=39454 lossy=230592
-> Bitmap Index Scan on t_x_idx (cost=0.00..53.50 rows=5000 width=0) (actual time=105.962..105.962 rows=593708 loops=1)
Index Cond: (x ~ '666666'::text)
Planning time: 0.420 ms
Execution time: 18111.739 ms
(8 rows)

postgres=> explain analyze select * from t where x ~ '666';
QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on t (cost=54.75..18107.92 rows=5000 width=22) (actual time=102.813..17285.086 rows=593708 loops=1)
Recheck Cond: (x ~ '666'::text)
Rows Removed by Index Recheck: 35665143
Heap Blocks: exact=39454 lossy=230592
-> Bitmap Index Scan on t_x_idx (cost=0.00..53.50 rows=5000 width=0) (actual time=96.100..96.100 rows=593708 loops=1)
Index Cond: (x ~ '666'::text)
Planning time: 0.500 ms
Execution time: 17300.440 ms
(8 rows)

关于PostgreSQL 使用 pg_trgm 比全扫描慢,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/57683799/

26 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com