search - nGram 部分匹配和限制 nGram 导致多字段查询-6ren

search - nGram 部分匹配和限制 nGram 导致多字段查询

转载作者：行者123 更新时间：2023-12-03 01:51:37

背景 :我通过索引标记化名称(name 字段)以及三元分析名称(ngram 字段)，对名称字段实现了部分搜索。
我已经提升了 name字段具有精确的标记匹配冒泡到结果的顶部。
问题 :我正在尝试实现一个查询，将 nGram 匹配限制为仅匹配查询字符串的某个阈值(比如 80%)的那些。我明白 minimum_should_match似乎是我正在寻找的，但我的问题是形成查询以实际产生这些结果。
我的精确标记匹配被提升到顶部，但我仍然得到在 ngram 中具有单个匹配三元组的每个文档。 field 。
GIST: Index settings and mapping
索引设置

{
  "my_index": {
    "settings": {
      "index": {
        "number_of_shards": "5",
        "max_result_window": "30000",
        "creation_date": "1475853851937",
        "analysis": {
          "filter": {
            "ngram_filter": {
              "type": "ngram",
              "min_gram": "3",
              "max_gram": "3"
            }
          },
          "analyzer": {
            "ngram_analyzer": {
              "filter": [
                "lowercase",
                "ngram_filter"
              ],
              "type": "custom",
              "tokenizer": "standard"
            }
          }
        },
        "number_of_replicas": "1",
        "uuid": "AuCjcP5sSb-m59bYrprFcw",
        "version": {
          "created": "2030599"
        }
      }
    }
  }
}

索引映射

{
  "my_index": {
    "mappings": {
      "my_type": {
        "properties": {
          "acw": {
            "type": "integer"
          },
          "pcg": {
            "type": "integer"
          },
          "date": {
            "type": "date",
            "format": "strict_date_optional_time||epoch_millis"
          },
          "dob": {
            "type": "date",
            "format": "strict_date_optional_time||epoch_millis"
          },
          "id": {
            "type": "string"
          },
          "name": {
            "type": "string",
            "boost": 10
          },
          "ngram": {
            "type": "string",
            "analyzer": "ngram_analyzer"
          },
          "bdk": {
            "type": "integer"
          },
          "mmw": {
            "type": "integer"
          },
          "mpi": {
            "type": "integer"
          },
          "sex": {
            "type": "string",
            "index": "not_analyzed"
          }
        }
      }
    }
  }
}

解决方案尝试
[ GIST:查询尝试 ] 由于 2 个链接限制而取消链接 :( (https://gist.github.com/jordancardwell/2e690013666e7e1da6ef1acee314b4e6)我尝试了一个多匹配查询，它给了我正确的搜索结果，但我没有幸运地忽略只匹配单个三元组的名称的结果(比如“ odo ”“ 中的三元组”)奥多 philus")

//this matches 'frodo' and sends results to the top, since `name` field is boosted
//  but also matches 'theodore' and 'rodolpho'

{
  "size":100,
  "from":0,
  "query":{
    "multi_match":{
      "query":"frodo",
      "fields":[
        "name",
        "ngram"
      ],
      "type":"best_fields"
    }
  }
}

//I then tried to throw in the `minimum_must_match` option
// hoping it would filter out large strings that only had one matching trigram for instance
{
  "size":100,
  "from":0,
  "query":{
    "multi_match":{
      "query":"frodo",
      "fields":[
        "name",
        "ngram"
      ],
      "type":"best_fields",
      "minimum_should_match": "90%",
    }
  }
}

我尝试在某种意义上进行操作，手动生成由此产生的匹配查询，以允许我仅应用 minimum_must_match到 ngram字段，但似乎无法正确使用语法。

// I then tried to contruct a custom query to just return the `minimum_should_match`d results on the ngram field
// I started with a query produced by using bodybuilder to `and` and `or` my other search criteria together
{
  "query": {
    "bool": {
      "filter": {
        "bool": {
          "must": [
            //each separate field's criteria `must`/`and`ed together
            {
              "query": {
                "bool": {
                  "filter": {
                    "bool": {
                      "should": [
                        //each critereon for a specific field `should`/`or`ed together
                        {
                         //my attempt at getting `ngram` field results.. 
                         // should theoretically only return when field 
                         // contains nothing but matching ngrams 
                         // (i.e. exact matches and other fluke matches)
                          "query": { 
                            "match": {
                              "ngram": {
                                "query": "frodo",
                                "minimum_should_match": "100%"
                              }
                            }
                          }
                        }
                        //... other critereon to be `should`/`or`ed together
                      ]
                    }
                  }
                }
              }
            }
            //... other criteria to be `must`/`and`ed together
          ]
        }
      }
    }
  }
}

谁能看到我做错了什么？
看起来这应该很容易完成，但我必须遗漏一些明显的东西。

更新
我用 _explain=true 运行了一个查询(使用感知 UI)尝试了解我的结果。
我查询了 match在 ngram字段为 "frod"与 minimum_should_match = 100% ，但我仍然得到每条至少匹配一个 ngram 的记录。
(例如 rodolpho 即使它不包含 fro )
GIST: test query and results

备注 :交叉发布自 [discuss.elastic.co]
稍后会做一个链接，还不能发布超过 2 个:/ (https://discuss.elastic.co/t/ngram-partial-match-limiting-ngram-results-in-multiple-field-query/62526)

最佳答案

我使用您的设置和映射来创建索引。你的查询对我来说似乎工作正常。我建议做一个 explain在正在返回的“意外”文档之一上，看看为什么它被匹配并与其他结果一起返回。

这是我所做的:

在您的分析器上运行分析 api 以查看查询将如何拆分为 token 。

curl -XGET 'localhost:9200/my_index/_analyze' -d '
{
  "analyzer" : "ngram_analyzer",
  "text" : "frodo"
}'

frodo 将被您的分析器分成 3 个 token 。

{
  "tokens": [
    {
      "token": "fro",
      "start_offset": 0,
      "end_offset": 5,
      "type": "word",
      "position": 0
    },
    {
      "token": "rod",
      "start_offset": 0,
      "end_offset": 5,
      "type": "word",
      "position": 0
    },
    {
      "token": "odo",
      "start_offset": 0,
      "end_offset": 5,
      "type": "word",
      "position": 0
    }
  ]
}

我索引了 3 个用于测试的文档(仅使用了 ngrams 字段)。以下是文档:

{
  "took": 5,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "failed": 0
  },
  "hits": {
    "total": 3,
    "max_score": 1,
    "hits": [
      {
        "_index": "my_index",
        "_type": "my_type",
        "_id": "2",
        "_score": 1,
        "_source": {
          "ngram": "theodore"
        }
      },
      {
        "_index": "my_index",
        "_type": "my_type",
        "_id": "1",
        "_score": 1,
        "_source": {
          "ngram": "frodo"
        }
      },
      {
        "_index": "my_index",
        "_type": "my_type",
        "_id": "3",
        "_score": 1,
        "_source": {
          "ngram": "rudolpho"
        }
      }
    ]
  }
}

你提到的第一个查询，它匹配 frodo 和 theodore，但不像你提到的 rudolpho - 这是有道理的，因为 rudolpho 不会产生任何与来自 frodo 的三元组匹配的三元组

frodo -> fro, rod, odo 

rudolpho -> rud, udo, dol, olp, lph, pho

使用您的第二个查询，我只返回 frodo (None the other two) 。

{
  "took": 5,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "failed": 0
  },
  "hits": {
    "total": 1,
    "max_score": 0.53148466,
    "hits": [
      {
        "_index": "my_index",
        "_type": "my_type",
        "_id": "1",
        "_score": 0.53148466,
        "_source": {
          "ngram": "frodo"
        }
      }
    ]
  }
}

然后我在其他两个文档(theodore 和 rudolpho)上运行了一个解释( localhost:9200/my_index/my_type/2/_explain )，我看到了这个(我剪掉了回复)

{
  "_index": "my_index",
  "_type": "my_type",
  "_id": "2",
  "matched": false,
  "explanation": {
    "value": 0,
    "description": "Failure to meet condition(s) of required/prohibited clause(s)",
    "details": [
      {
        "value": 0,
        "description": "no match on required clause ((ngram:fro ngram:rod ngram:odo)~2)",
        "details": [

以上是预期的，因为来自 frodo 的三个 token 中至少有两个应该匹配。

关于search - nGram 部分匹配和限制 nGram 导致多字段查询，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/39924784/

文章推荐： elasticsearch - Elasticsearch DSL用于2个时间戳之间的所有空值

文章推荐： audio - .WAV 格式(秒)的音频文件长度与文件大小有关吗？

文章推荐： php - Elasticsearch 中的“Less/more than value”搜索

文章推荐： php - 如何清理Elasticsearch自动生成的ID？

typescript - A 部分部分 io-ts
我在使用 io-ts 时遇到一些问题。我发现它确实缺乏文档，我取得的大部分进展都是通过 GitHub issues 取得的。不，我不明白 HKT，所以没有帮助。基本上，我在其他地方创建一个类型，ty
java - 匹配完整文件正则表达式中的 A 部分，但不匹配 B 部分
我必须创建一个正则表达式来搜索整个文件，以找到与 Java XML 解析器的第一部分(但不是第二部分)的匹配项。这将用于防止某些 XXE 攻击。不幸的是，它确实必须是单个正则表达式，并且它确实需要搜索
c# - 部分/部分中的 asp.net mvs 部分？
我有一些简单的 Shared/_Header.cshtml 文件中的内容。 My Shared/_Layout.cshtml 通过调用插入该代码 @Html.Partial("_Header") 目前
java - Selenium 只执行循环的 if != null 部分，不运行循环的 "else if null "部分
我有一个 if-else 语句，其中: 条件 1:ID 匹配并且自动填充某些字段。然后 if 语句只填充其余字段条件 2:ID 不匹配，所有字段均为空白。 ELSE 语句将它们全部填充当我使条件
javascript - 无法在 JSFIDDLE 中使用滚动魔法(第 1 部分，共 2 部分)
我正在开发一个单页滚动网站。我正在尝试实现 ScrollMagic 并固定第一部分，以便网站的其余部分滚动到固定部分的顶部。我尝试创建一个 jsfiddle 来显示问题，但我似乎无法让 jsfiddl
javascript - 既然有

首页

博学

6Ren·AI

商城

search - nGram 部分匹配和限制 nGram 导致多字段查询