gpt4 book ai didi

search - nGram 部分匹配和限制 nGram 导致多字段查询

转载 作者:行者123 更新时间:2023-12-03 01:51:37 29 4
gpt4 key购买 nike

背景 :我通过索引标记化名称(name 字段)以及三元分析名称(ngram 字段),对名称字段实现了部分搜索。
我已经提升了 name字段具有精确的标记匹配冒泡到结果的顶部。
问题 :我正在尝试实现一个查询,将 nGram 匹配限制为仅匹配查询字符串的某个阈值(比如 80%)的那些。我明白 minimum_should_match似乎是我正在寻找的,但我的问题是形成查询以实际产生这些结果。
我的精确标记匹配被提升到顶部,但我仍然得到在 ngram 中具有单个匹配三元组的每个文档。 field 。
GIST: Index settings and mapping
索引设置

{
"my_index": {
"settings": {
"index": {
"number_of_shards": "5",
"max_result_window": "30000",
"creation_date": "1475853851937",
"analysis": {
"filter": {
"ngram_filter": {
"type": "ngram",
"min_gram": "3",
"max_gram": "3"
}
},
"analyzer": {
"ngram_analyzer": {
"filter": [
"lowercase",
"ngram_filter"
],
"type": "custom",
"tokenizer": "standard"
}
}
},
"number_of_replicas": "1",
"uuid": "AuCjcP5sSb-m59bYrprFcw",
"version": {
"created": "2030599"
}
}
}
}
}
索引映射
{
"my_index": {
"mappings": {
"my_type": {
"properties": {
"acw": {
"type": "integer"
},
"pcg": {
"type": "integer"
},
"date": {
"type": "date",
"format": "strict_date_optional_time||epoch_millis"
},
"dob": {
"type": "date",
"format": "strict_date_optional_time||epoch_millis"
},
"id": {
"type": "string"
},
"name": {
"type": "string",
"boost": 10
},
"ngram": {
"type": "string",
"analyzer": "ngram_analyzer"
},
"bdk": {
"type": "integer"
},
"mmw": {
"type": "integer"
},
"mpi": {
"type": "integer"
},
"sex": {
"type": "string",
"index": "not_analyzed"
}
}
}
}
}
}
解决方案尝试
[ GIST:查询尝试 ] 由于 2 个链接限制而取消链接 :( (https://gist.github.com/jordancardwell/2e690013666e7e1da6ef1acee314b4e6)我尝试了一个多匹配查询,它给了我正确的搜索结果,但我没有幸运地忽略只匹配单个三元组的名称的结果(比如“ odo ”“ 中的三元组”)奥多 philus")
//this matches 'frodo' and sends results to the top, since `name` field is boosted
// but also matches 'theodore' and 'rodolpho'

{
"size":100,
"from":0,
"query":{
"multi_match":{
"query":"frodo",
"fields":[
"name",
"ngram"
],
"type":"best_fields"
}
}
}
.
//I then tried to throw in the `minimum_must_match` option
// hoping it would filter out large strings that only had one matching trigram for instance
{
"size":100,
"from":0,
"query":{
"multi_match":{
"query":"frodo",
"fields":[
"name",
"ngram"
],
"type":"best_fields",
"minimum_should_match": "90%",
}
}
}
我尝试在某种意义上进行操作,手动生成由此产生的匹配查询,以允许我仅应用 minimum_must_matchngram字段,但似乎无法正确使用语法。
// I then tried to contruct a custom query to just return the `minimum_should_match`d results on the ngram field
// I started with a query produced by using bodybuilder to `and` and `or` my other search criteria together
{
"query": {
"bool": {
"filter": {
"bool": {
"must": [
//each separate field's criteria `must`/`and`ed together
{
"query": {
"bool": {
"filter": {
"bool": {
"should": [
//each critereon for a specific field `should`/`or`ed together
{
//my attempt at getting `ngram` field results..
// should theoretically only return when field
// contains nothing but matching ngrams
// (i.e. exact matches and other fluke matches)
"query": {
"match": {
"ngram": {
"query": "frodo",
"minimum_should_match": "100%"
}
}
}
}
//... other critereon to be `should`/`or`ed together
]
}
}
}
}
}
//... other criteria to be `must`/`and`ed together
]
}
}
}
}
}
谁能看到我做错了什么?
看起来这应该很容易完成,但我必须遗漏一些明显的东西。

更新
我用 _explain=true 运行了一个查询(使用感知 UI)尝试了解我的结果。
我查询了 matchngram字段为 "frod"minimum_should_match = 100% ,但我仍然得到每条至少匹配一个 ngram 的记录。
(例如 rodolpho 即使它不包含 fro )
GIST: test query and results

备注 :交叉发布自 [discuss.elastic.co]
稍后会做一个链接,还不能发布超过 2 个:/ (https://discuss.elastic.co/t/ngram-partial-match-limiting-ngram-results-in-multiple-field-query/62526)

最佳答案

我使用您的设置和映射来创建索引。你的查询对我来说似乎工作正常。我建议做一个 explain在正在返回的“意外”文档之一上,看看为什么它被匹配并与其他结果一起返回。

这是我所做的:

在您的分析器上运行分析 api 以查看查询将如何拆分为 token 。

curl -XGET 'localhost:9200/my_index/_analyze' -d '
{
"analyzer" : "ngram_analyzer",
"text" : "frodo"
}'

frodo 将被您的分析器分成 3 个 token 。
{
"tokens": [
{
"token": "fro",
"start_offset": 0,
"end_offset": 5,
"type": "word",
"position": 0
},
{
"token": "rod",
"start_offset": 0,
"end_offset": 5,
"type": "word",
"position": 0
},
{
"token": "odo",
"start_offset": 0,
"end_offset": 5,
"type": "word",
"position": 0
}
]
}

我索引了 3 个用于测试的文档(仅使用了 ngrams 字段)。以下是文档:
{
"took": 5,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 3,
"max_score": 1,
"hits": [
{
"_index": "my_index",
"_type": "my_type",
"_id": "2",
"_score": 1,
"_source": {
"ngram": "theodore"
}
},
{
"_index": "my_index",
"_type": "my_type",
"_id": "1",
"_score": 1,
"_source": {
"ngram": "frodo"
}
},
{
"_index": "my_index",
"_type": "my_type",
"_id": "3",
"_score": 1,
"_source": {
"ngram": "rudolpho"
}
}
]
}
}

你提到的第一个查询,它匹配 frodo 和 theodore,但不像你提到的 rudolpho - 这是有道理的,因为 rudolpho 不会产生任何与来自 frodo 的三元组匹配的三元组
frodo -> fro, rod, odo 

rudolpho -> rud, udo, dol, olp, lph, pho

使用您的第二个查询,我只返回 frodo (None the other two) 。
{
"took": 5,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 1,
"max_score": 0.53148466,
"hits": [
{
"_index": "my_index",
"_type": "my_type",
"_id": "1",
"_score": 0.53148466,
"_source": {
"ngram": "frodo"
}
}
]
}
}

然后我在其他两个文档(theodore 和 rudolpho)上运行了一个解释( localhost:9200/my_index/my_type/2/_explain ),我看到了这个(我剪掉了回复)
{
"_index": "my_index",
"_type": "my_type",
"_id": "2",
"matched": false,
"explanation": {
"value": 0,
"description": "Failure to meet condition(s) of required/prohibited clause(s)",
"details": [
{
"value": 0,
"description": "no match on required clause ((ngram:fro ngram:rod ngram:odo)~2)",
"details": [

以上是预期的,因为来自 frodo 的三个 token 中至少有两个应该匹配。

关于search - nGram 部分匹配和限制 nGram 导致多字段查询,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/39924784/

29 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com