gpt4 book ai didi

elasticsearch - Elasticsearch简单术语查询给出了奇怪的分数

转载 作者:行者123 更新时间:2023-12-02 22:59:07 25 4
gpt4 key购买 nike

Elasticsearch版本:5.0.2

我用以下方法填充索引:

{_id: 1, tags: ['plop', 'plip', 'plup']},
{_id: 2, tags: ['plop', 'plup']},
{_id: 3, tags: ['plop']},
{_id: 4, tags: ['plap', 'plep']},
{_id: 5, tags: ['plop', 'plip', 'plup']},
{_id: 6, tags: ['plup', 'plip']},
{_id: 7, tags: ['plop', 'plip']}

然后,我想检索标记 plopplip的最大相关行:
query: {
bool: {
should: [
{term: {tags: {value:'plop', _name: 'plop'}}},
{term: {tags: {value:'plip', _name: 'plip'}}}
]
}
}

这等效于(但我使用前者进行调试):
query: {
bool: {
should: [
{terms: {tags: ['plop', 'plip']}}
]
}
}

然后,我发现分数确实很奇怪:
[
{ id: '2', score: 0.88002616, tags: [ 'plop', 'plup' ] },
{ id: '6', score: 0.88002616, tags: [ 'plup', 'plip' ] },
{ id: '5', score: 0.5063205, tags: [ 'plop', 'plip', 'plup' ] },
{ id: '7', score: 0.3610978, tags: [ 'plop', 'plip' ] },
{ id: '1', score: 0.29277915, tags: [ 'plop', 'plip', 'plup' ] },
{ id: '3', score: 0.2876821, tags: [ 'plop' ] }
]

以下是响应的详细信息:
{
"took": 1,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 6,
"max_score": 0.88002616,
"hits": [
{
"_index": "myindex",
"_type": "mytype",
"_id": "2",
"_score": 0.88002616,
"_source": {
"tags": [
"plop",
"plup"
]
},
"matched_queries": [
"plop"
]
},
{
"_index": "myindex",
"_type": "mytype",
"_id": "6",
"_score": 0.88002616,
"_source": {
"tags": [
"plup",
"plip"
]
},
"matched_queries": [
"plip"
]
},
{
"_index": "myindex",
"_type": "mytype",
"_id": "5",
"_score": 0.5063205,
"_source": {
"tags": [
"plop",
"plip",
"plup"
]
},
"matched_queries": [
"plop",
"plip"
]
},
{
"_index": "myindex",
"_type": "mytype",
"_id": "7",
"_score": 0.3610978,
"_source": {
"tags": [
"plop",
"plip"
]
},
"matched_queries": [
"plop",
"plip"
]
},
{
"_index": "myindex",
"_type": "mytype",
"_id": "1",
"_score": 0.29277915,
"_source": {
"tags": [
"plop",
"plip",
"plup"
]
},
"matched_queries": [
"plop",
"plip"
]
},
{
"_index": "myindex",
"_type": "mytype",
"_id": "3",
"_score": 0.2876821,
"_source": {
"tags": [
"plop"
]
},
"matched_queries": [
"plop"
]
}
]
}
}

因此,有两个问题:
  • 为什么仅对一个查询(id 2和6)进行行处理的行比对匹配两个(id 1、5和7)的行进行评分更好?
  • 为什么具有相同标签的两行可以具有不同的分数? (编号1和5)

  • 我错过了什么?

    最佳答案

    好吧,我想出你的真正问题。默认情况下,Elasitcsearch使用5个分片存储索引数据,如果数量较少,则在计算_score值时可能会很重要。有关分片的一些理论:https://www.elastic.co/guide/en/elasticsearch/reference/current/_basic_concepts.html

    为什么这有关系?因为为了获得更好的性能,每个分片都对自己的数据进行_score计算。但是在计算得分值时,elasticsearch使用IDF / TF算法,该算法依赖于文档总数和搜索词的频率(IN SHARD)(https://www.elastic.co/guide/en/elasticsearch/guide/current/scoring-theory.html)

    要解决此问题,您可以使用一个分片创建索引,如下所示:

    {
    "settings": {
    "number_of_shards" : 1,
    "number_of_replicas" : 0
    },
    "mappings": {
    "my_type": {
    "properties": {
    "tags": {
    "type": "keyword"
    }
    }
    }
    }
    }

    您可以在搜索查询中使用?explain验证我的理论:

    http://localhost:9200/test1/my_type/_search?explain



    或者,如果您需要更多内容,可以阅读此示例;)
    这些是我为您查询的结果:[“plop”,“plip”]
    {
    "took": 5,
    "timed_out": false,
    "_shards": {
    "total": 5,
    "successful": 5,
    "failed": 0
    },
    "hits": {
    "total": 6,
    "max_score": 0.9808292,
    "hits": [
    {
    "_index": "test",
    "_type": "my_type",
    "_id": "2",
    "_score": 0.9808292,
    "_source": {
    "tags": [
    "plop",
    "plup"
    ]
    }
    },
    {
    "_index": "test",
    "_type": "my_type",
    "_id": "6",
    "_score": 0.9808292,
    "_source": {
    "tags": [
    "plup",
    "plip"
    ]
    }
    },
    {
    "_index": "test",
    "_type": "my_type",
    "_id": "5",
    "_score": 0.5753642,
    "_source": {
    "tags": [
    "plop",
    "plip",
    "plup"
    ]
    }
    },
    {
    "_index": "test",
    "_type": "my_type",
    "_id": "1",
    "_score": 0.36464313,
    "_source": {
    "tags": [
    "plop",
    "plip",
    "plup"
    ]
    }
    },
    {
    "_index": "test",
    "_type": "my_type",
    "_id": "7",
    "_score": 0.36464313,
    "_source": {
    "tags": [
    "plop",
    "plip"
    ]
    }
    },
    {
    "_index": "test",
    "_type": "my_type",
    "_id": "3",
    "_score": 0.2876821,
    "_source": {
    "tags": [
    "plop"
    ]
    }
    }
    ]
    }
    }

    为什么文件plop,plip,plup为第三位?检查解释这一点:
       "_shard": "[test][1]",
    "_node": "LjGrgIa7QgiPlEvMxqKOdA",
    "_index": "test",
    "_type": "my_type",
    "_id": "5",
    "_score": 0.5753642,
    "_source": {
    "tags": [
    "plop",
    "plip",
    "plup"
    ]
    },

    这是该分片中唯一的一个文档:test [1](我在其他返回的文档中进行了验证)!因此IDF值等于“1”,这是可能的最高值。分数= TF / IDF(因此对于较低的IDF,分数较高)。检查如何为此文档计算此0.5753642分数:
     "value": 0.2876821,
    "description": "weight(tags:plop...

    "details": [
    {
    "value": 0.2876821,
    "description": "idf(docFreq=1, docCount=1)",


      {
    "value": 0.2876821,
    "description": "weight(tags:plip..

    "value": 0.2876821,
    "description": "idf(docFreq=1, docCount=1)",
    "details": []
    },

    关于elasticsearch - Elasticsearch简单术语查询给出了奇怪的分数,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/41036961/

    25 4 0
    Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
    广告合作:1813099741@qq.com 6ren.com