elasticsearch - 使用multi_match的查询未获得预期的顺序-6ren

elasticsearch - 使用multi_match的查询未获得预期的顺序

转载作者：行者123 更新时间：2023-12-02 22:57:46

我需要在文档中找到短语，并且需要查看标题和内容。标题比内容重要，因此我希望得到以下结果:

获取标题和内容都匹配的第一个文档

然后仅在标题

中具有匹配项的文档

然后获取仅在内容

中具有匹配项的文档

似乎是很基本的东西。

所以我创建了这样的索引和数据:

PUT /test_index

PUT /test_index/article/3263
{
  "id": 3263,
  "pagetitle": "Lösungen",
  "searchable_content": "abc"
}


PUT /test_index/article/1005
{
  "id": 1005,
  "pagetitle": "Lösungen",
  "searchable_content": "test! Lösungen test?"
}

PUT /test_index/article/677
{
  "id": 677,
  "pagetitle": "Lösungen",
  "searchable_content": "test Lösungen test!"
}

PUT /test_index/article/666
{
  "id": 666,
  "pagetitle": "abc",
  "searchable_content": "test Lösungen test abc"
}

我运行这样的查询:

GET /test_index/_search
{
    "query": {
        "bool": {
            "must": [{
                    "multi_match": {
                        "query": "Lösungen",
                        "fields": ["pagetitle^2", "searchable_content"]
                    }
                }
            ]
        }
    },
    "highlight": {
        "fields": {
            "pagetitle": {},
            "searchable_content": {}
        }
    }
}

但是结果却不如我预期。我得到只有标题匹配的文档，然后才得到标题和内容都匹配的文档，如下所示:

{
  "took": 1,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "failed": 0
  },
  "hits": {
    "total": 4,
    "max_score": 0.5753642,
    "hits": [
      {
        "_index": "test_index",
        "_type": "article",
        "_id": "3263",
        "_score": 0.5753642,
        "_source": {
          "id": 3263,
          "pagetitle": "Lösungen",
          "searchable_content": "abc"
        },
        "highlight": {
          "pagetitle": [
            "<em>Lösungen</em>"
          ]
        }
      },
      {
        "_index": "test_index",
        "_type": "article",
        "_id": "1005",
        "_score": 0.36464313,
        "_source": {
          "id": 1005,
          "pagetitle": "Lösungen",
          "searchable_content": "test! Lösungen test?"
        },
        "highlight": {
          "searchable_content": [
            "test! <em>Lösungen</em> test?"
          ],
          "pagetitle": [
            "<em>Lösungen</em>"
          ]
        }
      },
      {
        "_index": "test_index",
        "_type": "article",
        "_id": "677",
        "_score": 0.36464313,
        "_source": {
          "id": 677,
          "pagetitle": "Lösungen",
          "searchable_content": "test Lösungen test!"
        },
        "highlight": {
          "searchable_content": [
            "test <em>Lösungen</em> test!"
          ],
          "pagetitle": [
            "<em>Lösungen</em>"
          ]
        }
      },
      {
        "_index": "test_index",
        "_type": "article",
        "_id": "666",
        "_score": 0.2876821,
        "_source": {
          "id": 666,
          "pagetitle": "abc",
          "searchable_content": "test Lösungen test abc"
        },
        "highlight": {
          "searchable_content": [
            "test <em>Lösungen</em> test abc"
          ]
        }
      }
    ]
  }
}

我试图做的是通过增加 Realm 来操纵更多。似乎在上述情况下，可以为两个字段设置boost，并使用 most_fields这样的类型:

GET /test_index/_search
{
    "query": {
        "bool": {
            "must": [{
                    "multi_match": {
                        "query": "Lösungen",
                        "fields": ["pagetitle^3", "searchable_content^2"],
                        "type": "most_fields"                       
                    }
                }
            ]
        }
    },
    "highlight": {
        "fields": {
            "pagetitle": {},
            "searchable_content": {}
        }
    }
}

这为这组数据提供了预期的结果。

但是，如果我添加2条额外的记录:

PUT /test_index/article/999
{
  "id": 999,
  "pagetitle": "abc",
  "searchable_content": "test Lösungen test abc double match Lösungen"
}


PUT /test_index/article/1006
{
  "id": 1006,
  "pagetitle": "Lösungen and Lösungen",
  "searchable_content": "test sample"
}

它不再起作用了，因为现在的结果是这样的:

{
  "took": 1,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "failed": 0
  },
  "hits": {
    "total": 6,
    "max_score": 2.2315955,
    "hits": [
      {
        "_index": "test_index",
        "_type": "article",
        "_id": "1006",
        "_score": 2.2315955,
        "_source": {
          "id": 1006,
          "pagetitle": "Lösungen and Lösungen",
          "searchable_content": "test sample"
        },
        "highlight": {
          "pagetitle": [
            "<em>Lösungen</em> and <em>Lösungen</em>"
          ]
        }
      },
      {
        "_index": "test_index",
        "_type": "article",
        "_id": "666",
        "_score": 1.219939,
        "_source": {
          "id": 666,
          "pagetitle": "abc",
          "searchable_content": "test Lösungen test abc"
        },
        "highlight": {
          "searchable_content": [
            "test <em>Lösungen</em> test abc"
          ]
        }
      },
      {
        "_index": "test_index",
        "_type": "article",
        "_id": "1005",
        "_score": 0.86785066,
        "_source": {
          "id": 1005,
          "pagetitle": "Lösungen",
          "searchable_content": "test! Lösungen test?"
        },
        "highlight": {
          "searchable_content": [
            "test! <em>Lösungen</em> test?"
          ],
          "pagetitle": [
            "<em>Lösungen</em>"
          ]
        }
      },
      {
        "_index": "test_index",
        "_type": "article",
        "_id": "677",
        "_score": 0.86785066,
        "_source": {
          "id": 677,
          "pagetitle": "Lösungen",
          "searchable_content": "test Lösungen test!"
        },
        "highlight": {
          "searchable_content": [
            "test <em>Lösungen</em> test!"
          ],
          "pagetitle": [
            "<em>Lösungen</em>"
          ]
        }
      },
      {
        "_index": "test_index",
        "_type": "article",
        "_id": "3263",
        "_score": 0.8630463,
        "_source": {
          "id": 3263,
          "pagetitle": "Lösungen",
          "searchable_content": "abc"
        },
        "highlight": {
          "pagetitle": [
            "<em>Lösungen</em>"
          ]
        }
      },
      {
        "_index": "test_index",
        "_type": "article",
        "_id": "999",
        "_score": 0.7876096,
        "_source": {
          "id": 999,
          "pagetitle": "abc",
          "searchable_content": "test Lösungen test abc double match Lösungen"
        },
        "highlight": {
          "searchable_content": [
            "test <em>Lösungen</em> test abc double match <em>Lösungen</em>"
          ]
        }
      }
    ]
  }
}

因此，如您所见，仅内容匹配的文本的标题和内容匹配的文本的优先级更高。

您能给我解释一下我在做什么错吗，如何解决？

最佳答案

尝试像这样的恒定分数:

GET test_index/_search
{
  "query": {
    "bool": {
      "should": [
        {
          "constant_score": {
            "query": {
              "match": {
                "pagetitle": {
                  "query": "Lösungen"
                }
              }
            },
            "boost": 2
          }
        },
        {
          "constant_score": {
            "query": {
              "match": {
                "searchable_content": "Lösungen"
              }
            }
          }
        }
      ]
    }
  },
  "highlight": {
    "fields": {
      "pagetitle": {},
      "searchable_content": {}
    }
  }
}

根据文档显示的恒定分数:“...包装另一个查询，仅返回等于过滤器中每个文档的查询提升的恒定分数。” ref
@davide的链接将帮助您理解为什么即使对searchable_content进行匹配也可以使文档得分更高。由于您要忽略字段之间的术语频率和IDF，因此可以在每个字段的匹配项上使用恒定分数。
编辑:
根据原始问题中列出的规则，以上查询可以正常工作。但是，基于OP的评论，我们也需要根据搜索词的出现频率对结果进行排名。因此，显然，术语频率和文档的逆向频率很重要，但是也许我们在这里不太关心字段长度(如果我们只想根据出现次数对结果进行排名)。在这种情况下，我建议您像这样设置索引:

POST test_index_v1
{
  "mappings": {
      "article": {
        "properties": {
          "id": {
            "type": "long"
          },
          "pagetitle": {
            "type": "string",
            "norms": {
              "enabled": false
            }
          },
          "searchable_content": {
            "type": "string",
            "norms": {
              "enabled": false
            }
          }
        }
      }
   }
}

注意:在版本5及更高版本中， type: string替换为 type: text。
@davide提到的 link描述了禁用规范的功能。
其次，由于要在少量文档上运行查询，并假设为索引分配了多个分片，因此最好使用 search_type=dfs_query_then_fetch运行查询，因为每个分片的本地IDF会有很大不同。 (阅读 this)
第三，在最后一个查询中添加我们想要的只是考虑TF-IDF的权重。最后一个查询是对文档进行完全相同的排名，无论是在同一字段中出现2到3个搜索词。
我们可以添加一个bool-should块，以将其添加到常量得分块的得分中，如下所示:

GET test_index_v1/_search?search_type=dfs_query_then_fetch
{
  "query": {
    "bool": {
      "should": [
        {
          "constant_score": {
            "query": {
              "match": {
                "pagetitle": {
                  "query": "Lösungen"
                }
              }
            },
            "boost": 2
          }
        },
        {
          "constant_score": {
            "query": {
              "match": {
                "searchable_content": "Lösungen"
              }
            }
          }
        },
        {
          "bool": {
            "should": [
              {
                "match": {
                  "pagetitle": {
                    "query": "Lösungen",
                    "boost": 2
                  }
                }
              },
              {
                "match": {
                  "searchable_content": "Lösungen"
                }
              }
            ]
          }
        }
      ]
    }
  },
  "highlight": {
    "fields": {
      "pagetitle": {},
      "searchable_content": {}
    }
  }
}

关于elasticsearch - 使用multi_match的查询未获得预期的顺序，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/46213773/

文章推荐： powershell - 如何选择字符串多行？

文章推荐： c# - 加载关卡后声音无法播放

变异操作的 GraphQL 顺序
我正在创建一个有效的突变，但我不确定它是否按照我认为的方式工作。但是，我想知道执行顺序是什么？异步从上到下同步同步随机顺序其他我想确保在执行插入/更新插入之前从表中删除某些项目。使用以下突变
isabelle - field 顺序
如何更改规则中的前提顺序？例如，在伊莎贝尔的自然演绎规则中: mp: ?P ⟶ ?Q ⟹ ?P ⟹ ?Q 我们可以将顺序更改为: ?P ⟹ ?P ⟶ ?Q ⟹ ?Q 我可以用 rev_mp或者定义一
java - LinkedHashMap 顺序
关闭。这个问题需要details or clarity .它目前不接受答案。想改善这个问题吗？通过 editing this post 添加详细信息并澄清问题. 8年前关闭。 Improve thi
按关联的 hibernate 顺序
我正在使用 Hibernate 3.2，并使用标准来构建查询。我想为多对一关联添加和“排序”，但我不知道如何做到这一点。 Hibernate 查询最终看起来像这样，我猜: select t1.a, t
Javascript:顺序，而不是并行
我正在开发一个项目，但无法让我的 javascript 按顺序工作。我知道 javascript 可以并行执行任务，因此当您向不响应的服务器发出请求时，它不会被卡住。这有它的优点和缺点。就我而言，这是
dart - future 顺序
在下面的代码中，我认为f1 > f2 > f3是调用顺序，但是仅f1被调用。如何获得依次调用的3个函数？我已经将以下内容添加到main函数中，它可以按预期工作，但是我想知道是否还有其他确定的方法可以
javascript - 在对象数组中添加位置/顺序
我有一个如下所示的对象数组: [{ "id": 1, "Size": 90, "Maturity": 24, }, { "id": 2, "S
docker - Docker多阶段构建:顺序
这是征求意见和要求的请求。我是Docker的新手。我想要一个用于Python项目的生产和开发容器(可能也进行单元测试)。我的搜索指向多阶段Dockerfile(以及运行它们的多个docker-com
r - 所有可能的组合(顺序)
我想知道解决以下问题的有效方法是什么: 假设我在组 1 中有三个字符，在组 2 中有两个字符: group_1 = c("X", "Y", "Z") group_2 = c("A", "B") 显然，
Cordova Hook 顺序
在 Cordova 网站上，您可以看到一长串按字母顺序排列的钩子(Hook)列表，但它们触发和执行的正确顺序是什么？我正在尝试在构建/编译之前将 cordova.js 脚本添加到 index.htm
r - 所有可能的组合(顺序)
我想知道解决以下问题的有效方法是什么: 假设我在组 1 中有三个字符，在组 2 中有两个字符: group_1 = c("X", "Y", "Z") group_2 = c("A", "B") 显然，
JAVA HashSet 顺序
这个问题已经有答案了: 奥 git _a (2 个回答) 已关闭 9 年前。这是我的一个练习的代码， public class RockTest { public static void main(
java - java中哪些数据结构支持排序/顺序
我使用 HashMap 来存储一些数据，但每当新数据保存到 HashMap 或旧数据移出 HashMap 时，我都需要将其保持升序。但是hashmap本身不支持顺序，我可以使用什么数据结构来支持顺序？
f# - 顺序 - 随后几年的同一日期
我想创建一个序列，当星期几与函数参数中的日期相同时，它会返回所有年份的结果(例如:自开始日期起，2 月 12 日为星期日的所有年份)。 let myDate (dw:System.DayOfWeek)
C# LINQ 顺序
我有一个包含许多元素的 Xelement。我有以下代码来对它们进行排序: var calculation = from y in x.Elements("row")
Javascript Action 顺序
假设我有: 在 javacript 文件中，我为类按钮和 ID 名称定义了点击操作，例如: $("#name").click(function(event){ alert("hi"); }) $
Swift LayoutSubViews 顺序
我有一个包含 2 个 subview 的 View - collectionView 和自定义 View 。我想设置一个操作在布置 2 个 View 后运行，但layoutSubViews 运行了两次
Java 顺序 UUID
关闭。这个问题需要更多 focused .它目前不接受答案。想改进这个问题？更新问题，使其仅关注一个问题 editing this post . 2年前关闭。 Improve this questi
c++ - 如何比较两个双向迭代器的(顺序)？
我想知道 C++ 中是否有内置方法来比较两个双向迭代器的顺序。例如，我有一个 Sum 函数来计算同一列表中 2 个迭代器之间的总和: double Sum(std::list::const_itera
MySQL ORDER BY 顺序
在 MySQL 中，这两个查询之间有区别吗？ SELECT * FROM .... ORDER BY Created,Id DESC 和 SELECT * FROM .... ORDER BY Cre

行者123

个人简介

我是一名优秀的程序员,十分优秀！

作者热门文章

滴滴打车优惠券免费领取

全站热门文章

首页

博学

6Ren·AI

商城

elasticsearch - 使用multi_match的查询未获得预期的顺序