gpt4 book ai didi

lucene - 与 ElasticSearch 的精确文档匹配

转载 作者:行者123 更新时间:2023-12-02 07:36:39 26 4
gpt4 key购买 nike

我需要针对一组“短文档”进行精确查询。示例:

文件:

  1. {"name": "John Doe", "alt": "John W Doe"
  2. {"name": "我的 friend John Doe", "alt": "John A Doe"
  3. {"name": "John", "alt": "Susy"}
  4. {"name": "Jack", "alt": "John Doe"

预期结果:

  1. 如果我搜索“John Doe”,我希望 1 的分数比 2 和 4 的分数大得多
  2. 如果我搜索“John Doé”,与上面相同
  3. 如果我搜索“John”,我想得到 3(完全匹配比重复名称和 alt 更好)

用 ES 可以吗?我怎样才能做到这一点?我尝试提升“名称”,但我无法找到如何准确匹配文档字段,而不是在其中搜索。

最佳答案

您所描述的正是搜索引擎在默认情况下的工作方式。搜索 "John Doe" 会变成搜索词 "john""doe"。对于每个术语,它会查找包含该术语的文档,然后根据以下条件为每个文档分配一个 _score:

  • 该术语在所有文档中的常见程度(更常见 == 不太相关)
  • 术语在文档领域内的常见程度(更常见 == 更相关)
  • 文档的字段有多长(越长==越不相关)

您看不到清晰结果的原因是 Elasticsearch 是分布式的,并且您正在使用少量数据进行测试。默认情况下,一个索引有 5 个主分片,您的文档在不同的分片上建立索引。每个分片都有自己的文档频率计数,因此分数被扭曲。

当您添加真实世界的数据量时,频率本身会超过分片,但要测试少量数据,您需要执行以下两项操作之一:

  1. 创建一个只有一个主分片的索引,或者
  2. 指定 search_type=dfs_query_then_fetch 在使用全局频率运行查询之前首先从每个分片中获取频率

为了演示,首先索引您的数据:

curl -XPUT 'http://127.0.0.1:9200/test/test/1?pretty=1'  -d '
{
"alt" : "John W Doe",
"name" : "John Doe"
}
'
curl -XPUT 'http://127.0.0.1:9200/test/test/2?pretty=1' -d '
{
"alt" : "John A Doe",
"name" : "My friend John Doe"
}
'
curl -XPUT 'http://127.0.0.1:9200/test/test/3?pretty=1' -d '
{
"alt" : "Susy",
"name" : "John"
}
'
curl -XPUT 'http://127.0.0.1:9200/test/test/4?pretty=1' -d '
{
"alt" : "John Doe",
"name" : "Jack"
}
'

现在,搜索 "john doe",记住指定 dfs_query_then_fetch

curl -XGET 'http://127.0.0.1:9200/test/test/_search?pretty=1&search_type=dfs_query_then_fetch'  -d '
{
"query" : {
"match" : {
"name" : "john doe"
}
}
}
'

Doc 1 是结果中的第一个:

# {
# "hits" : {
# "hits" : [
# {
# "_source" : {
# "alt" : "John W Doe",
# "name" : "John Doe"
# },
# "_score" : 1.0189849,
# "_index" : "test",
# "_id" : "1",
# "_type" : "test"
# },
# {
# "_source" : {
# "alt" : "John A Doe",
# "name" : "My friend John Doe"
# },
# "_score" : 0.81518793,
# "_index" : "test",
# "_id" : "2",
# "_type" : "test"
# },
# {
# "_source" : {
# "alt" : "Susy",
# "name" : "John"
# },
# "_score" : 0.3066778,
# "_index" : "test",
# "_id" : "3",
# "_type" : "test"
# }
# ],
# "max_score" : 1.0189849,
# "total" : 3
# },
# "timed_out" : false,
# "_shards" : {
# "failed" : 0,
# "successful" : 5,
# "total" : 5
# },
# "took" : 8
# }

当您只搜索 "john" 时:

curl -XGET 'http://127.0.0.1:9200/test/test/_search?pretty=1&search_type=dfs_query_then_fetch'  -d '
{
"query" : {
"match" : {
"name" : "john"
}
}
}
'

文档 3 首先出现:

# {
# "hits" : {
# "hits" : [
# {
# "_source" : {
# "alt" : "Susy",
# "name" : "John"
# },
# "_score" : 1,
# "_index" : "test",
# "_id" : "3",
# "_type" : "test"
# },
# {
# "_source" : {
# "alt" : "John W Doe",
# "name" : "John Doe"
# },
# "_score" : 0.625,
# "_index" : "test",
# "_id" : "1",
# "_type" : "test"
# },
# {
# "_source" : {
# "alt" : "John A Doe",
# "name" : "My friend John Doe"
# },
# "_score" : 0.5,
# "_index" : "test",
# "_id" : "2",
# "_type" : "test"
# }
# ],
# "max_score" : 1,
# "total" : 3
# },
# "timed_out" : false,
# "_shards" : {
# "failed" : 0,
# "successful" : 5,
# "total" : 5
# },
# "took" : 5
# }

忽略重音

第二个问题是匹配 "John Doé"。这是分析的问题。为了使全文更易于搜索,我们将其分析成单独的术语或标记,它们存储在索引中。为了在用户搜索 john 时匹配 johnJohnJOHN,每个术语/标记是通过多个 token 过滤器,将它们放入标准形式。

当我们进行全文搜索时,搜索词会经历完全相同的过程。所以如果我们有一个包含 John 的文档,它被索引为 john,如果用户搜索 JOHN,我们实际上搜索的是 约翰

为了使 Doé 匹配 doe,我们需要一个去除重音符号的标记过滤器,我们需要将它应用到被索引的文本和搜索词。最简单的方法是使用 ASCII folding token filter .

我们可以在创建索引时定义一个自定义分析器,我们可以在映射中指定特定字段应该在索引时和搜索时使用该分析器。

首先,删除旧索引:

curl -XDELETE 'http://127.0.0.1:9200/test/?pretty=1' 

然后创建索引,指定自定义分析器和映射:

curl -XPUT 'http://127.0.0.1:9200/test/?pretty=1'  -d '
{
"settings" : {
"analysis" : {
"analyzer" : {
"no_accents" : {
"filter" : [
"standard",
"lowercase",
"asciifolding"
],
"type" : "custom",
"tokenizer" : "standard"
}
}
}
},
"mappings" : {
"test" : {
"properties" : {
"name" : {
"type" : "string",
"analyzer" : "no_accents"
}
}
}
}
}
'

重新索引数据:

curl -XPUT 'http://127.0.0.1:9200/test/test/1?pretty=1'  -d '
{
"alt" : "John W Doe",
"name" : "John Doe"
}
'
curl -XPUT 'http://127.0.0.1:9200/test/test/2?pretty=1' -d '
{
"alt" : "John A Doe",
"name" : "My friend John Doe"
}
'
curl -XPUT 'http://127.0.0.1:9200/test/test/3?pretty=1' -d '
{
"alt" : "Susy",
"name" : "John"
}
'
curl -XPUT 'http://127.0.0.1:9200/test/test/4?pretty=1' -d '
{
"alt" : "John Doe",
"name" : "Jack"
}
'

现在,测试搜索:

curl -XGET 'http://127.0.0.1:9200/test/test/_search?pretty=1&search_type=dfs_query_then_fetch'  -d '
{
"query" : {
"match" : {
"name" : "john doé"
}
}
}
'

# {
# "hits" : {
# "hits" : [
# {
# "_source" : {
# "alt" : "John W Doe",
# "name" : "John Doe"
# },
# "_score" : 1.0189849,
# "_index" : "test",
# "_id" : "1",
# "_type" : "test"
# },
# {
# "_source" : {
# "alt" : "John A Doe",
# "name" : "My friend John Doe"
# },
# "_score" : 0.81518793,
# "_index" : "test",
# "_id" : "2",
# "_type" : "test"
# },
# {
# "_source" : {
# "alt" : "Susy",
# "name" : "John"
# },
# "_score" : 0.3066778,
# "_index" : "test",
# "_id" : "3",
# "_type" : "test"
# }
# ],
# "max_score" : 1.0189849,
# "total" : 3
# },
# "timed_out" : false,
# "_shards" : {
# "failed" : 0,
# "successful" : 5,
# "total" : 5
# },
# "took" : 6
# }

关于lucene - 与 ElasticSearch 的精确文档匹配,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/15547349/

26 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com