gpt4 book ai didi

postgresql - PostgreSQL 中 80M 记录的正则表达式查询缓慢

转载 作者:行者123 更新时间:2023-11-29 12:17:28 28 4
gpt4 key购买 nike

我有一个包含 8000 万行的只读表:

   Column    |          Type          | Modifiers | Storage  | Stats target | Description 
-------------+------------------------+-----------+----------+--------------+-------------
id | character(11) | not null | extended | |
gender | character(1) | | extended | |
postal_code | character varying(10) | | extended | |
operator | character varying(5) | | extended | |

Indexes:
"categorised_phones_pkey" PRIMARY KEY, btree (id)
"operator_idx" btree (operator)
"postal_code_trgm_idx" gin (postal_code gin_trgm_ops)

id 是主键,包含唯一的手机号码。表格行如下所示:

      id        |     gender   |   postal_code  |   operator
----------------+--------------+----------------+------------
09567849087 | m | 7414776788 | mtn
09565649846 | f | 1268398732 | mci
09568831245 | f | 7412556443 | mtn
09469774390 | m | 5488312790 | mci

此查询第一次大约需要 65 秒,下次大约需要 8 秒:

select operator,count(*) from categorised_phones where postal_code like '1%' group by operator;

输出看起来像这样:

operator |  count  
----------+---------
mci | 4050314
mtn | 6235778

explain alanyze 的输出:

HashAggregate  (cost=1364980.61..1364980.63 rows=2 width=10) (actual time=8257.026..8257.026 rows=2 loops=1)
Group Key: operator
-> Bitmap Heap Scan on categorised_phones (cost=95969.17..1312915.34 rows=10413054 width=2) (actual time=1140.803..6332.534 rows=10286092 loops=1)
Recheck Cond: ((postal_code)::text ~~ '1%'::text)
Rows Removed by Index Recheck: 25105697
Heap Blocks: exact=50449 lossy=237243
-> Bitmap Index Scan on postal_code_trgm_idx (cost=0.00..93365.90 rows=10413054 width=0) (actual time=1129.270..1129.270 rows=10287127 loops=1)
Index Cond: ((postal_code)::text ~~ '1%'::text)
Planning time: 0.540 ms
Execution time: 8257.392 ms

我怎样才能使这个查询更快?

任何想法将不胜感激。

附言:

我正在使用 PostgreSQL 9.6.1

更新

我刚刚更新了问题。我禁用了 Parallel Query,结果发生了变化。

最佳答案

对于涉及LIKE '%start' 形式比较的查询,并遵循 PostgreSQL 自己的建议,您可以使用以下索引:

CREATE INDEX postal_code_idx  ON categorised_phones (postal_code varchar_pattern_ops) ;

有了该索引和一些模拟数据,您的执行计划很可能如下所示:

| QUERY PLAN                                                                                                                             || :------------------------------------------------------------------------------------------------------------------------------------- || HashAggregate  (cost=2368.65..2368.67 rows=2 width=12) (actual time=18.093..18.094 rows=2 loops=1)                                     ||   Group Key: operator                                                                                                                  ||   ->  Bitmap Heap Scan on categorised_phones  (cost=536.79..2265.83 rows=20564 width=4) (actual time=2.564..12.061 rows=22171 loops=1) ||         Filter: ((postal_code)::text ~~ '1%'::text)                                                                                    ||         Heap Blocks: exact=1455                                                                                                        ||         ->  Bitmap Index Scan on postal_code_idx  (cost=0.00..531.65 rows=21923 width=0) (actual time=2.386..2.386 rows=22171 loops=1) ||               Index Cond: (((postal_code)::text ~>=~ '1'::text) AND ((postal_code)::text ~<~ '2'::text))                               || Planning time: 0.119 ms                                                                                                                || Execution time: 18.122 ms                                                                                                              |

You can check it at dbfiddle here

If you have both queries with LIKE 'start%' and LIKE '%middle%', you should add this index, but keep the one already in place. Trigram indexes might prove useful with this second kind of match.


Why?

From PostgreSQL documentation on operator classes:

The operator classes text_pattern_ops, varchar_pattern_ops, and bpchar_pattern_ops support B-tree indexes on the types text, varchar, and char respectively. The difference from the default operator classes is that the values are compared strictly character by character rather than according to the locale-specific collation rules. This makes these operator classes suitable for use by queries involving pattern matching expressions (LIKE or POSIX regular expressions) when the database does not use the standard "C" locale.

From PostgreSQL documentation on Index Types

The optimizer can also use a B-tree index for queries involving the pattern matching operators LIKE and ~ if the pattern is a constant and is anchored to the beginning of the string — for example, col LIKE 'foo%' or col ~ '^foo', but not col LIKE '%bar'. However, if your database does not use the C locale you will need to create the index with a special operator class to support indexing of pattern-matching queries; see Section 11.9 below. It is also possible to use B-tree indexes for ILIKE and ~*, but only if the pattern starts with non-alphabetic characters, i.e., characters that are not affected by upper/lower case conversion.


UPDATE

If the queries performed involved always a fix (and relatively small) number of LIKE 'x%' expressions, consider using partial indexes.

For instance, for LIKE '1%', you'd have the following index, and the following query plan (it shows about a 3x improvement):

CREATE INDEX idx_1 ON categorised_phones (operator) WHERE postal_code LIKE '1%';
VACUUM categorised_phones ;
| QUERY PLAN                                                                                                                                    || :-------------------------------------------------------------------------------------------------------------------------------------------- || GroupAggregate  (cost=0.29..658.74 rows=3 width=12) (actual time=3.235..6.493 rows=2 loops=1)                                                 ||   Group Key: operator                                                                                                                         ||   ->  Index Only Scan using idx_1 on categorised_phones  (cost=0.29..554.10 rows=20921 width=4) (actual time=0.028..3.266 rows=22290 loops=1) ||         Heap Fetches: 0                                                                                                                       || Planning time: 0.293 ms                                                                                                                       || Execution time: 6.517 ms                                                                                                                      |

关于postgresql - PostgreSQL 中 80M 记录的正则表达式查询缓慢,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/45378989/

28 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com