arrays - 如何为 pg_trgm `' term' % ANY (array_column)` 查询索引字符串数组列？-6ren

arrays - 如何为 pg_trgm `' term' % ANY (array_column)` 查询索引字符串数组列？

转载作者：行者123 更新时间：2023-11-29 11:19:07

我已经尝试了普通的 Postgres gin 索引以及 pg_trgm gin_trgm_ops 和 gist_trgm_ops 索引(使用此解决方法:https://stackoverflow.com/a/33016333/283398) .

但是 EXPLAIN 在我的查询 'term' % ANY (array_column) 中显示顺序扫描，即使在执行 set enable_seqscan = off; 之后也是如此。

(对于我的用例，我需要部分匹配，而 pg_trgm 似乎比全文搜索更适合，因为我的数据不是语言数据。我的 pg_trgm 结果质量非常好。)

我的用例是带有数组列的行，其中包含名字和全名的混合(以空格分隔)。搜索词可以是名字、姓氏或全名(以空格分隔)。 pg_trgm % 运算符结果不区分大小写，并且似乎在数组列中的名称的开头和结尾处高度匹配，这对于全名非常有用，因为它找到匹配的名字和姓氏，但不一定是中间名。

https://github.com/theirix/parray_gin很有前途，但是很旧，并且没有声称支持比 9.2 更新的 Postgres。

最佳答案

为什么这不起作用

索引类型(即运算符类)gin_trgm_ops 基于 % 运算符，它作用于两个 text 参数:

CREATE OPERATOR trgm.%(
  PROCEDURE = trgm.similarity_op,
  LEFTARG = text,
  RIGHTARG = text,
  COMMUTATOR = %,
  RESTRICT = contsel,
  JOIN = contjoinsel);

您不能对数组使用 gin_trgm_ops。为数组列定义的索引永远不会与 any(array[...]) 一起使用，因为数组的各个元素没有索引。索引数组需要不同类型的索引，即 gin 数组索引。

幸运的是，索引 gin_trgm_ops 的设计非常巧妙，它可以与运算符 like 和 ilike 一起使用，可以用作替代解决方案(下面描述的示例)。

测试表

有两列 (id serial primary key, names text[]) 并且包含 100000 个拉丁句子拆分成数组元素。

select count(*), sum(cardinality(names))::int words from test;

 count  |  words  
--------+---------
 100000 | 1799389

select * from test limit 1;

 id |                                                     names                                                     
----+---------------------------------------------------------------------------------------------------------------
  1 | {fugiat,odio,aut,quis,dolorem,exercitationem,fugiat,voluptates,facere,error,debitis,ut,nam,et,voluptatem,eum}

搜索单词片段 praesent 在 2400 毫秒内得到 7051 行:

explain analyse
select count(*)
from test
where 'praesent' % any(names);

                                                  QUERY PLAN                                                   
---------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=5479.49..5479.50 rows=1 width=0) (actual time=2400.866..2400.866 rows=1 loops=1)
   ->  Seq Scan on test  (cost=0.00..5477.00 rows=996 width=0) (actual time=1.464..2400.271 rows=7051 loops=1)
         Filter: ('praesent'::text % ANY (names))
         Rows Removed by Filter: 92949
 Planning time: 1.038 ms
 Execution time: 2400.916 ms

物化 View

一种解决方案是规范化模型，包括创建一个在一行中具有单个名称的新表。由于现有的查询、 View 、函数或其他依赖关系，此类重组可能难以实现，有时甚至是不可能的。在不改变表结构的情况下，使用物化 View 可以实现类似的效果。

create materialized view test_names as
    select id, name, name_id
    from test
    cross join unnest(names) with ordinality u(name, name_id)
    with data;

With ordinality 不是必需的，但在按照与主表中相同的顺序聚合名称时很有用。查询 test_names 同时给出与主表相同的结果。

创建索引后执行时间反复减少:

create index on test_names using gin (name gin_trgm_ops);

explain analyse
select count(distinct id)
from test_names
where 'praesent' % name

                                                                QUERY PLAN                                                                 
-------------------------------------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=4888.89..4888.90 rows=1 width=4) (actual time=56.045..56.045 rows=1 loops=1)
   ->  Bitmap Heap Scan on test_names  (cost=141.95..4884.39 rows=1799 width=4) (actual time=10.513..54.987 rows=7230 loops=1)
         Recheck Cond: ('praesent'::text % name)
         Rows Removed by Index Recheck: 7219
         Heap Blocks: exact=8122
         ->  Bitmap Index Scan on test_names_name_idx  (cost=0.00..141.50 rows=1799 width=0) (actual time=9.512..9.512 rows=14449 loops=1)
               Index Cond: ('praesent'::text % name)
 Planning time: 2.990 ms
 Execution time: 56.521 ms

该解决方案有一些缺点。因为 View 是物化的，所以数据在数据库中存储了两次。您必须记住在更改主表后刷新 View 。并且查询可能会更复杂，因为需要将 View 连接到主表。

使用`ilike`

我们可以在表示为文本的数组上使用ilike。我们需要一个不可变函数来为整个数组创建索引:

create function text(text[])
returns text language sql immutable as
$$ select $1::text $$

create index on test using gin (text(names) gin_trgm_ops);

并在查询中使用该函数:

explain analyse
select count(*)
from test
where text(names) ilike '%praesent%' 

                                                           QUERY PLAN                                                            
---------------------------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=117.06..117.07 rows=1 width=0) (actual time=60.585..60.585 rows=1 loops=1)
   ->  Bitmap Heap Scan on test  (cost=76.08..117.03 rows=10 width=0) (actual time=2.560..60.161 rows=7051 loops=1)
         Recheck Cond: (text(names) ~~* '%praesent%'::text)
         Heap Blocks: exact=2899
         ->  Bitmap Index Scan on test_text_idx  (cost=0.00..76.08 rows=10 width=0) (actual time=2.160..2.160 rows=7051 loops=1)
               Index Cond: (text(names) ~~* '%praesent%'::text)
 Planning time: 3.301 ms
 Execution time: 60.876 ms

60 与 2400 毫秒相比，无需创建额外关系即可获得相当不错的结果。

这个解决方案看起来更简单，需要的工作也更少，但是前提是 ilike(不如 trgm % 运算符精确的工具)就足够了。

为什么我们应该将整个数组作为文本使用 ilike 而不是 %？相似性很大程度上取决于文本的长度。在各种长度的长文本中，很难为搜索单词选择一个合适的限制。例如。使用 limit = 0.3 我们得到了结果:

with data(txt) as (
values
    ('praesentium,distinctio,modi,nulla,commodi,tempore'),
    ('praesentium,distinctio,modi,nulla,commodi'),
    ('praesentium,distinctio,modi,nulla'),
    ('praesentium,distinctio,modi'),
    ('praesentium,distinctio'),
    ('praesentium')
)
select length(txt), similarity('praesent', txt), 'praesent' % txt "matched?"
from data;

 length | similarity | matched? 
--------+------------+----------
     49 |   0.166667 | f           <--!
     41 |        0.2 | f           <--!
     33 |   0.228571 | f           <--!
     27 |   0.275862 | f           <--!
     22 |   0.333333 | t
     11 |   0.615385 | t
(6 rows)

关于arrays - 如何为 pg_trgm `' term' % ANY (array_column)` 查询索引字符串数组列？，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/39480580/

文章推荐： postgresql - 用户不能使用分机 "uuid-ossp"

行者123

个人简介

我是一名优秀的程序员,十分优秀！

作者热门文章

滴滴打车优惠券免费领取

全站热门文章

首页

博学

6Ren·AI

商城