gpt4 book ai didi

elasticsearch - 获取桶中所有文档的总数

转载 作者:行者123 更新时间:2023-12-02 22:36:59 24 4
gpt4 key购买 nike

当我通过以下聚合进行搜索时:

"aggregations": {
"codes": {
"terms": {
"field": "code"
},
"aggs": {
"dates": {
"date_range": {
"field": "created_time",
"ranges": [
{
"from": "2017-12-06T00:00:00.000",
"to": "2017-12-06T16:00:00.000"
},
{
"from": "2017-12-07T00:00:00.000",
"to": "2017-12-07T23:59:59.999"
}
]
}
}
}
}
}

我得到以下结果:

"aggregations": {
"codes": {
"buckets": [
{
"key": "123456",
"doc_count": 104005499,
"dates": {
"buckets": [
{
"key": "2017-12-05T20:00:00.000Z-2017-12-06T12:00:00.000Z",
"from_as_string": "2017-12-05T20:00:00.000Z",
"to_as_string": "2017-12-06T12:00:00.000Z",
"doc_count": 156643
},
{
"key": "2017-12-06T20:00:00.000Z-2017-12-07T19:59:59.999Z",
"from_as_string": "2017-12-06T20:00:00.000Z",
"to_as_string": "2017-12-07T19:59:59.999Z",
"doc_count": 11874
}
]
}
},
...
]
}
}

所以现在我有一个桶的桶列表。我需要每个桶的总计数值,它是内部桶的 doc_counts 的总和。例如,我的第一个桶的总数应该是 156643 + 11874 = 168517。我试过使用 Sub Bucket 聚合,但是

 "totalcount": {
"sum_bucket": {
"buckets_path": "dates"
}
}

这是行不通的,因为 “buckets_path 必须引用数值或单值数值度量聚合,得到:org.elasticsearch.search.aggregations.bucket.range.date.InternalDateRange.Bucket” 。有什么想法我应该怎么做吗?

最佳答案

看起来这是一个已知问题。有一个 discussion在 Elastic 论坛上,我找到了解决它的 hack(感谢作者 Ruslan_Didyk,顺便说一句):

POST my_aggs/my_doc/_search
{
"size": 0,
"aggregations": {
"codes": {
"terms": {
"field": "code"
},
"aggs": {
"dates": {
"date_range": {
"field": "created_time",
"ranges": [
{
"from": "2017-12-06T00:00:00.000",
"to": "2017-12-06T16:00:00.000"
},
{
"from": "2017-12-07T00:00:00.000",
"to": "2017-12-07T23:59:59.999"
}
]
},
"aggs": {
"my_cnt": {
"value_count": {
"field": "created_time"
}
}
}
},
"totalcount": {
"stats_bucket": {
"buckets_path": "dates>my_cnt"
}
}
}
}
}
}

你不能只做 totalcount 的原因是因为 date_range 隐式创建子桶和管道聚合无法处理它(我会说这是一个错误 Elasticsearch )。

所以 hack 是向 dates 添加另一个子聚合:my_cnt,它只计算存储桶中的文档数量。 (请注意,我在 created_time 字段上使用了 value_count 聚合,假设它存在于所有文档中并且只有一个值。)

给定这样的文档集:

{"code":"1234","created_time":"2017-12-06T01:00:00"}
{"code":"1234","created_time":"2017-12-06T17:00:00"}
{"code":"1234","created_time":"2017-12-07T01:00:00"}
{"code":"1234","created_time":"2017-12-06T02:00:00"}
{"code":"1235","created_time":"2017-12-07T18:00:00"}
{"code":"1234","created_time":"2017-12-07T18:00:00"}

聚合的结果将是:

  "aggregations": {
"codes": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "1234",
"doc_count": 5,
"dates": {
"buckets": [
{
"key": "2017-12-06T00:00:00.000Z-2017-12-06T16:00:00.000Z",
"from": 1512518400000,
"from_as_string": "2017-12-06T00:00:00.000Z",
"to": 1512576000000,
"to_as_string": "2017-12-06T16:00:00.000Z",
"doc_count": 2,
"my_cnt": {
"value": 2
}
},
{
"key": "2017-12-07T00:00:00.000Z-2017-12-07T23:59:59.999Z",
"from": 1512604800000,
"from_as_string": "2017-12-07T00:00:00.000Z",
"to": 1512691199999,
"to_as_string": "2017-12-07T23:59:59.999Z",
"doc_count": 2,
"my_cnt": {
"value": 2
}
}
]
},
"totalcount": {
"count": 2,
"min": 2,
"max": 2,
"avg": 2,
"sum": 4
}
},
{
"key": "1235",
"doc_count": 1,
"dates": {
"buckets": [
{
"key": "2017-12-06T00:00:00.000Z-2017-12-06T16:00:00.000Z",
"from": 1512518400000,
"from_as_string": "2017-12-06T00:00:00.000Z",
"to": 1512576000000,
"to_as_string": "2017-12-06T16:00:00.000Z",
"doc_count": 0,
"my_cnt": {
"value": 0
}
},
{
"key": "2017-12-07T00:00:00.000Z-2017-12-07T23:59:59.999Z",
"from": 1512604800000,
"from_as_string": "2017-12-07T00:00:00.000Z",
"to": 1512691199999,
"to_as_string": "2017-12-07T23:59:59.999Z",
"doc_count": 1,
"my_cnt": {
"value": 1
}
}
]
},
"totalcount": {
"count": 1,
"min": 1,
"max": 1,
"avg": 1,
"sum": 1
}
}
]
}
}

所需的值在 totalcount.sum 下。

一些注意事项

正如我已经说过的,只有当 created_time 始终存在并且正好是一个 的假设成立时,这才有效。如果在不同的情况下 date_range 聚合下的字段将有多个值(例如 update_time 表示文档的所有更新),则 sum 将不再等于实际值匹配文档的数量(如果这些日期重叠)。

在这种情况下,您始终可以使用 filterrange 聚合里面查询。

希望对您有所帮助!

关于elasticsearch - 获取桶中所有文档的总数,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/47714229/

24 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com