gpt4 book ai didi

elasticsearch - 如何使较短(较近)的 token 匹配更相关? (edge_ngram)

转载 作者:行者123 更新时间:2023-12-03 00:44:26 24 4
gpt4 key购买 nike

我使用用于自动完成的edge_ngram标记生成器的结果很奇怪。我试图弄清楚如何使我的结果更相关。我从elasticsearch文档中复制了example
我有以下说明的文档:

  • “苹果,生的,没有皮肤”
  • “苹果,生的,金黄色的美味,有皮”
  • “辣椒的APPLEBEE'S”
  • “婴儿食品,水果,苹果酱,初中”

  • 如果我搜索 apple,则“APPLEBEE'S,chili”的得分要高于“无皮苹果”
    如果我搜索 apples,则“婴儿食品,水果,苹果酱,初中”的得分要高于“苹果,生的,金黄的,有皮的苹果”
    在这两种情况下,我都希望对更相关的更近/更短匹配具有更高的分数(即,当我搜索appleapples时,包含单词apples的结果应比APPLEBEE'Sapplesauce更高的分数。
    我的设置是:
    {
    "settings": {
    "analysis": {
    "analyzer": {
    "autocomplete": {
    "tokenizer": "autocomplete",
    "filter": [
    "lowercase",
    "asciifolding"
    ]
    },
    "autocomplete_search": {
    "tokenizer": "lowercase"
    }
    },
    "tokenizer": {
    "autocomplete": {
    "type": "edge_ngram",
    "min_gram": 2,
    "max_gram": 20,
    "token_chars": [
    "letter"
    ]
    }
    }
    }
    },
    "mappings": {
    "properties": {
    "description": {
    "type": "text",
    "analyzer": "autocomplete",
    "search_analyzer": "autocomplete_search"
    }
    }
    }
    }
    查询:
    "query": {
    "match": {
    "description": {
    "query": "apple",
    "operator": "and"
    }
    }
    }
    如何使相关性更高的得分更高?

    最佳答案

    由于新的BM25算法(用于评分)中称为(dl)的匹配字段的长度而导致发生此问题,您可以轻松地在查询中使用explain param来详细了解它

    http://{{hostname}}:{{port}}//_search?explain=true


    由于 APPLEBEE'S, chili的长度最短,因此得分更高,这是此文档的tf得分
     {
    "value": 0.5344296,
    "description": "tf, computed as freq / (freq + k1 * (1 - b + b * dl / avgdl)) from:",
    "details": [
    {
    "value": 1.0,
    "description": "freq, occurrences of term within document",
    "details": []
    },
    {
    "value": 1.2,
    "description": "k1, term saturation parameter",
    "details": []
    },
    {
    "value": 0.75,
    "description": "b, length normalization parameter",
    "details": []
    },
    {
    "value": 11.0,
    "description": "dl, length of field", ---> note this
    "details": []
    },
    {
    "value": 17.333334,
    "description": "avgdl, average length of field",
    "details": []
    }
    ]
    }
    解决方案
    您需要创建另一个使用 english分析器的字段,如 multi-fields示例所示,以下是完整示例
    索引示例
    {
    "settings": {
    "analysis": {
    "analyzer": {
    "autocomplete": {
    "tokenizer": "autocomplete",
    "filter": [
    "lowercase",
    "asciifolding"
    ]
    },
    "autocomplete_search": {
    "tokenizer": "lowercase"
    }
    },
    "tokenizer": {
    "autocomplete": {
    "type": "edge_ngram",
    "min_gram": 2,
    "max_gram": 20,
    "token_chars": [
    "letter"
    ]
    }
    }
    }
    },
    "mappings": {
    "properties": {
    "name": {
    "type": "text",
    "analyzer": "autocomplete",
    "search_analyzer": "autocomplete_search",
    "fields": {
    "english": {
    "type": "text",
    "analyzer": "english"
    }
    }
    }
    }
    }
    }
    }
    并索引您的样本文档
    {
    "name" : "Apples, raw, without skin"
    }
    {
    "name" : "APPLEBEE'S, chili"
    }
    {
    "name" : "Babyfood, fruit, applesauce, junior"
    }
    {
    "name" : "Apples, raw, golden delicious, with skin"
    }
    并搜索查询
    {
    "query": {
    "bool": {
    "should": [
    {
    "multi_match": {
    "query": "apple",
    "fields": [
    "name.english",
    "name"
    ]
    }
    }
    ]
    }
    }
    }
    和搜索结果,请注意包含apple的文档的得分更高
     "hits": [
    {
    "_index": "edgelow",
    "_type": "_doc",
    "_id": "1",
    "_score": 0.6747451,
    "_source": {
    "name": "Apples, raw, without skin"
    }
    },
    {
    "_index": "edgelow",
    "_type": "_doc",
    "_id": "4",
    "_score": 0.60996956,
    "_source": {
    "name": "Apples, raw, golden delicious, with skin"
    }
    },
    {
    "_index": "edgelow",
    "_type": "_doc",
    "_id": "2",
    "_score": 0.12822598,
    "_source": {
    "name": "APPLEBEE'S, chili"
    }
    },
    {
    "_index": "edgelow",
    "_type": "_doc",
    "_id": "3",
    "_score": 0.09446116,
    "_source": {
    "name": "Babyfood, fruit, applesauce, junior"
    }
    }
    ]

    关于elasticsearch - 如何使较短(较近)的 token 匹配更相关? (edge_ngram),我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/64530450/

    24 4 0
    Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
    广告合作:1813099741@qq.com 6ren.com