gpt4 book ai didi

PostgreSQL、三元组和相似性

转载 作者:行者123 更新时间:2023-11-29 11:18:17 24 4
gpt4 key购买 nike

只是在我的 Mac 上测试 PostgreSQL 9.6.2 并使用 Ngrams。假设在酒厂字段上有一个GIN三元组索引。

相似度限制(我知道这已被弃用):

SELECT set_limit(0.5);

我正在 2,3M 行表上构建三元组搜索。

我的选择代码:

SELECT winery, similarity(winery, 'chateau chevla blanc') AS similarity 
FROM usr_wines
WHERE status=1 AND winery % 'chateau chevla blanc'
ORDER BY similarity DESC;

我的结果(在我的 Mac 上为 329 毫秒):

Chateau ChevL Blanc 0,85
Chateau Blanc 0,736842
Chateau Blanc 0,736842
Chateau Blanc 0,736842
Chateau Blanc 0,736842
Chateau Blanc, 0,736842
Chateau Blanc 0,736842
Chateau Cheval Blanc 0,727273
Chateau Cheval Blanc 0,727273
Chateau Cheval Blanc 0,727273
Chateau Cheval Blanc (7) 0,666667
Chateau Cheval Blanc Cbo 0,64
Chateau Du Cheval Blanc 0,64
Chateau Du Cheval Blanc 0,64

好吧,在这种情况下,我不明白“Chateau blanc”与“Chateau Cheval Blanc”有何相似之处?我知道这两个词是完全相同的“chateau”和“blanc”,但没有其他词“cheval”。

还有为什么“Chateau ChevL Blanc”排在第一位?少了一个字母“a”!

好吧,我的目标是匹配所有可能重复的酒厂名称,即使拼写错误。我错过了什么?

最佳答案

三元组相似度的概念依赖于将任何句子分成“三元组”(三个连续字母的序列),并将结果视为一个集合(即:顺序无关紧要,并且您没有重复值)。在考虑句子之前,开头加两个空格,结尾加一个空格,单空格换成双空格。

TrigramsN-grams 的特例.

“Chateau blanc”对应的三元组是通过找到出现在它上面的三个字母的所有序列来找到的:

  chateau  blanc
--- => ' c'
--- => ' ch'
--- => 'cha'
--- => 'hat'
--- => 'ate'
--- => 'tea'
--- => 'eau'
--- => 'au '
--- => 'u '
--- => ' b'
--- => ' bl'
--- => 'bla'
--- => 'lan'
--- => 'anc'
--- => 'nc '

对它们进行排序并去除重复项可以让您:

'  b'
' c'
' bl'
' ch'
'anc'
'ate'
'au '
'bla'
'cha'
'eau'
'hat'
'lan'
'nc '
'tea'

这可以由 PostgreSQL 通过函数 show_trgm 计算:

SELECT show_trgm('Chateau blanc') AS A

A = [ b, c, bl, ch,anc,ate,au ,bla,cha,eau,hat,lan,nc ,tea]

... 有 14 个八卦。 (检查 pg_trgm )。

而“Chateau Cheval Blanc”对应的卦集合是:

SELECT show_trgm('Chateau Cheval Blanc') AS B 

B = [ b, c, bl, ch,anc,ate,au ,bla,cha,che,eau,evl,hat,hev,la ,lan,nc ,tea,vla]

... 有 19 个卦

如果您数一数有多少八卦具有这两个集合,您会发现它们有以下几个:

A intersect B = 
[ b, c, bl, ch,anc,ate,au ,bla,cha,eau,hat,lan,nc ,tea]

他们总共拥有的是:

A union B = 
[ b, c, bl, ch,anc,ate,au ,bla,cha,che,eau,evl,hat,hev,la ,lan,nc ,tea,vla]

即两个句子共有14个八卦,一共19个。
相似度计算如下:

 similarity = 14 / 19

您可以通过以下方式检查:

SELECT 
cast(14.0/19.0 as real) AS computed_result,
similarity('Chateau blanc', 'chateau cheval blanc') AS function_in_pg

你会看到你得到:0.736842

... 这解释了如何 相似度是如何计算的,以及为什么你得到你得到的值。


注意:您可以通过以下方式计算交集和并集:

SELECT 
array_agg(t) AS in_common
FROM
(
SELECT unnest(show_trgm('Chateau blanc')) AS t
INTERSECT
SELECT unnest(show_trgm('chateau chevla blanc')) AS t
ORDER BY t
) AS trigrams_in_common ;

SELECT
array_agg(t) AS in_total
FROM
(
SELECT unnest(show_trgm('Chateau blanc')) AS t
UNION
SELECT unnest(show_trgm('chateau chevla blanc')) AS t
) AS trigrams_in_total ;

这是一种探索不同句子对相似度的方法:

WITH p AS
(
SELECT
'This is just a sentence I''ve invented'::text AS f1,
'This is just a sentence I''ve also invented'::text AS f2
),
t1 AS
(
SELECT unnest(show_trgm(f1)) FROM p
),
t2 AS
(
SELECT unnest(show_trgm(f2)) FROM p
),
x AS
(
SELECT
(SELECT count(*) FROM
(SELECT * FROM t1 INTERSECT SELECT * FROM t2) AS s0)::integer AS same,
(SELECT count(*) FROM
(SELECT * FROM t1 UNION SELECT * FROM t2) AS s0)::integer AS total,
similarity(f1, f2) AS sim_2
FROM
p
)
SELECT
same, total, same::real/total::real AS sim_1, sim_2
FROM
x ;

可以在Rextester查看

关于PostgreSQL、三元组和相似性,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/43156987/

24 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com