gpt4 book ai didi

elasticsearch - 如何考虑单词顺序但不要求所有搜索到的单词都存在于ElasticSearch match_phrase查询的文档中?

转载 作者:行者123 更新时间:2023-12-03 02:20:39 25 4
gpt4 key购买 nike

假设我的索引有两个文档:

  • “拿钱”
  • “我的钱到了这里”

  • 当我对“获取我的钱”进行常规匹配查询时,两个文档都正确匹配,但它们得到的分数相等。但是,我希望评分时的字词顺序有意义。换句话说,我希望“赚钱”获得更高的分数。

    因此,我尝试将match查询放入bool查询的must子句中,并包含match_phrase(具有相同的查询字符串)。在我搜索“我如何获得我的钱”之前,这似乎可以正确打出匹配。在这种情况下,match_phrase查询似乎不匹配,并且命中再次以相等的分数返回。

    如何构造索引/查询,使其考虑单词顺序但不要求所有搜索到的单词都存在于文档中?

    Index mapping with test data


    PUT test-index
    {
    "mappings": {
    "properties" : {
    "keyword" : {
    "type" : "text",
    "similarity": "boolean"
    }
    }
    }
    }
    POST test-index/_doc/
    {
    "keyword" : "get my money"
    }
    POST test-index/_doc/
    {
    "keyword" : "my money get here"
    }

    Query "How do I get my money" - Doesn't work as needed


    GET /test-index/_search
    {
    "query": {
    "bool": {
    "must": [
    {
    "match": {
    "keyword": "how do i get my money"
    }
    }
    ],
    "should": [
    {
    "match_phrase": {
    "keyword": {
    "query": "how do i get my money"
    }
    }
    }
    ]
    }
    }
    }

    Results (Both documents scored same)


    {
    "took" : 2,
    "timed_out" : false,
    "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
    },
    "hits" : {
    "total" : {
    "value" : 2,
    "relation" : "eq"
    },
    "max_score" : 3.0,
    "hits" : [
    {
    "_index" : "test-index",
    "_type" : "_doc",
    "_id" : "6Xy8wXIB3NtI_ttPGBoV",
    "_score" : 3.0,
    "_source" : {
    "keyword" : "get my money"
    }
    },
    {
    "_index" : "test-index",
    "_type" : "_doc",
    "_id" : "6ny8wXIB3NtI_ttPGBpV",
    "_score" : 3.0,
    "_source" : {
    "keyword" : "my money get here"
    }
    }
    ]
    }
    }

    编辑1

    正如@gibbs建议的那样,让我们​​删除 "similarity": "boolean"。下面介绍了一个更简化和重点突出的问题。我们正在努力寻找答案。

    Removed "similarity": "boolean"


    PUT test-index
    {
    "mappings": {
    "properties" : {
    "keyword" : {
    "type" : "text"
    }
    }
    }
    }
    POST test-index/_doc/
    {
    "keyword": "get my money"
    }
    POST test-index/_doc/
    {
    "keyword": "my money get here"
    }

    如何使该查询返回结果?现在没有。如果使用 match_phrase,如果文档中不存在所有搜索到的单词,是否可以返回结果?
    GET /test-index/_search
    {
    "query": {
    "bool": {
    "should": [
    {
    "match_phrase": {
    "keyword": {
    "query": "how do I get my money"
    }
    }
    }
    ]
    }
    }
    }

    编辑2

    在我们的用例中,我们不能使用BM25(TF / IDF),因为这会弄乱我们的结果。
    POST test-index/_doc
    {
    "keyword": "get my money, claim, distribution, getting started"
    }

    POST test-index/_doc
    {
    "keyword": "my money get here"
    }
    GET /test-index/_search 
    {
    "query": {
    "bool": {
    "must": [
    {
    "match": {
    "keyword": "how do I get my money"
    }
    }
    ]
    }
    }
    }

    Results


    {
    "took" : 16,
    "timed_out" : false,
    "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
    },
    "hits" : {
    "total" : {
    "value" : 2,
    "relation" : "eq"
    },
    "max_score" : 0.6156533,
    "hits" : [
    {
    "_index" : "test-index",
    "_type" : "_doc",
    "_id" : "JnxCw3IB3NtI_ttPBjQv",
    "_score" : 0.6156533,
    "_source" : {
    "keyword" : "my money get here"
    }
    },
    {
    "_index" : "test-index",
    "_type" : "_doc",
    "_id" : "x3xSw3IB3NtI_ttP1DUi",
    "_score" : 0.49206492,
    "_source" : {
    "keyword" : "get my money, claim, distribution, getting started"
    }
    }
    ]
    }
    }

    在这种情况下 我的钱到了这里比预期的得分高得多。因此,在分数计算将取决于匹配的文档数,字段长度等的情况下,我们无法做到这一点。

    很抱歉很长的问题。因此,回到我的原始问题 ,如何构造索引/查询,使其考虑单词顺序,但不要求所有搜索到的单词都存在于文档中?

    最佳答案

    问题是由于您的similarity参数。

    A simple boolean similarity, which is used when full-text ranking is not needed and the score should only be based on whether the query terms match or not. Boolean similarity gives terms a score equal to their query boost



    Reference

    您应该使用其他相似性参数( BM25)获得更好的分数。

    我从映射中删除了 similarity参数,并为相同的数据建立了索引。
    它使用了默认的 similarity参数。

    得分如下。
    {
    "took": 1069,
    "timed_out": false,
    "_shards": {
    "total": 1,
    "successful": 1,
    "skipped": 0,
    "failed": 0
    },
    "hits": {
    "total": {
    "value": 2,
    "relation": "eq"
    },
    "max_score": 0.5809142,
    "hits": [
    {
    "_index": "test-index",
    "_type": "_doc",
    "_id": "WpaHwnIBa8oXh9OgX4Hb",
    "_score": 0.5809142,
    "_source": {
    "keyword": "get my money"
    }
    },
    {
    "_index": "test-index",
    "_type": "_doc",
    "_id": "W5aHwnIBa8oXh9OgeYG9",
    "_score": 0.5167642,
    "_source": {
    "keyword": "my money get here"
    }
    }
    ]
    }
    }

    关于elasticsearch - 如何考虑单词顺序但不要求所有搜索到的单词都存在于ElasticSearch match_phrase查询的文档中?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/62427934/

    25 4 0
    Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
    广告合作:1813099741@qq.com 6ren.com