gpt4 book ai didi

elasticsearch aggs 返回错误的计数

转载 作者:行者123 更新时间:2023-12-02 22:16:59 26 4
gpt4 key购买 nike

我正在尝试进行一些聚合查询并遇到一些问题。

GET /my_index/_search
{
"size" : 0,
"aggs":{
"group_by":{
"terms": {
"field" : "category"
}
}
}
}

这是返回我:

"hits": {
"total": 180,
"max_score": 0,
"hits": []
},
"aggregations": {
"group_by": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 1,
"buckets": [
{
"key": "pf_rd_m",
"doc_count": 139
},
{
"key": "other",
"doc_count": 13
},
{
"key": "_encoding",
"doc_count": 12
},
{
"key": "ie",
"doc_count": 10
},
{
"key": "cadeaux",
"doc_count": 2
},
{
"key": "cartes",
"doc_count": 2
},
{
"key": "cheques",
"doc_count": 2
},
{
"key": "home",
"doc_count": 2
},
{
"key": "nav_logo",
"doc_count": 1
},
{
"key": "ref",
"doc_count": 1
}
]
}

如您所见,这告诉我有 180 个文档,但如果我对存储桶中的每个键的 doc_count 求和,我会发现更多元素...

这肯定是对 elasticsearch 标记化机制的影响 (https://www.elastic.co/guide/en/elasticsearch/guide/current/aggregations-and-analysis.html)

所以我尝试了这篇 es 帖子中的解决方案,但仍然无法正常工作。这是我的映射

"properties":{
"status":{
"type":"integer",
"index":"analyzed"
},
"category":{
"type":"string",
"fields": {
"raw" : {
"type": "string",
"index": "not_analyzed"
}
}
},
"dynamic_templates": [
{ "notanalyzed": {
"match": "*",
"match_mapping_type": "string",
"mapping": {
"type": "string",
"index": "not_analyzed"
}
}
}
]
}

如您所见,我有一个名为“类别”的字段。并将“raw”添加为 not_analyzed 字符串,但仍然返回错误的数字。

当我尝试这个时:

GET /my_index/_search
{
"size" : 0,
"aggs":{
"group_by":{
"terms": {
"field" : "category.raw"
}
}
}
}

返回:

"hits": {
"total": 180,
"max_score": 0,
"hits": []
},
"aggregations": {
"group_by": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": []
}
}

这很奇怪。有什么帮助吗?

最佳答案

documentation 中所述,

the document counts (and the results of any sub aggregations) in the terms aggregation are not always accurate. This is because each shard provides its own view of what the ordered list of terms should be and these are combined to give a final view

要以牺牲资源为代价来解决这个问题,可以使用分片大小参数。
同样,来自文档:
分片大小

The higher the requested size is, the more accurate the results will be, but also, the more expensive it will be to compute the final results (both due to bigger priority queues that are managed on a shard level and due to bigger data transfers between the nodes and the client). The shard_size parameter can be used to minimize the extra work that comes with bigger requested size. When defined, it will determine how many terms the coordinating node will request from each shard. Once all the shards responded, the coordinating node will then reduce them to a final result which will be based on the size parameter - this way, one can increase the accuracy of the returned terms and avoid the overhead of streaming a big list of buckets back to the client. If set to 0, the shard_size will be set to Integer.MAX_VALUE.

如果将分片大小参数添加到查询中:

GET /my_index/_search
{
"size" : 0,
"aggs":{
"group_by":{
"terms": {
"field" : "category.raw",
"shard_size" : 0
}
}
}
}

关于elasticsearch aggs 返回错误的计数,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/35526893/

26 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com