gpt4 book ai didi

elasticsearch - 在Elasticsearch中嵌套查询?

转载 作者:行者123 更新时间:2023-12-03 00:18:22 27 4
gpt4 key购买 nike

我的团队拥有几个仪表板,并正在考虑迁移到Elasticsearch以便合并软件堆栈的可能性。我们公开的一种常见图表类型是“每天结束时正在处理的工作流程是什么?”。以下是一些示例数据:

day workflow_id version status
20151101 1 1 In Progress
20151101 2 1 In Progress
20151102 1 2 In Progress
20151102 3 1 In Progress
20151102 4 1 In Progress
20151102 2 2 Completed
20151103 1 3 Completed
20151103 3 2 In Progress
20151104 3 3 Completed
20151105 4 2 Completed

每次工作流程中发生更改时,都会插入一条新记录,这可能会或可能不会更改状态。带有max(version)的记录是工作流ID的最新数据。

目的是要有一张图表显示每天结束时“进行中”和“完成”工作流程的总数。这仅应考虑到当天为止具有最大版本号的记录。这可以使用嵌套查询在SQL中完成:
with 

snapshot_dates as
(select distinct day from workflow),

snapshot as
(select d.day, w.workflow_id, max(w.version) as max_version
from snapshot_dates d, workflow w
where d.day >= w.day
group by d.day, w.workflow_id
order by d.day, w.workflow_id)

select s.day, w.status, count(1)
from workflow w join snapshot s on w.workflow_id=s.workflow_id and w.version = s.max_version
group by s.day, w.status
order by s.day, w.status;

这是查询的预期输出:
day,status,count  
20151101,In Progress,2
20151102,Completed,1
20151102,In Progress,3
20151103,Completed,2
20151103,In Progress,2
20151104,Completed,3
20151104,In Progress,1
20151105,Completed,4

我仍然对Elasticsearch还是陌生的,不知道Elasticsearch是否可以通过正确定义映射和查询而无需使用应用程序侧逻辑就能进行类似的查询。更一般而言,使用Elasticsearch解决此类问题的最佳实践是什么?

最佳答案

我尝试使用bucket selector aggregation找到解决方案,但是我陷入了困境。我在elasticsearch forum中讨论了同样的问题。以下是Christian Dahlqvist的建议。

In addition to this you also index the record into a workflow-centric index with a unique identifier, e.g. workflow id, as the document id. If several updates come in for the same workflow, each will result in an update and the latest state will be preserved. Running aggregations across this index to find the current or latest state will be considerably more efficient and scalable as you only have a single record per workflow and do not need to filter out documents based on relationships to other documents.



因此,根据此建议,在建立索引时应使用 Workflow Id作为文档ID。并且只要该工作流程有更新,就可以使用工作流程ID更新新版本和日期。假设索引名称为 workflow,索引类型为 workflow_status。因此,此 workflow_status类型的映射如下:
{
"workflow_status": {
"properties": {
"date": {
"type": "date",
"format": "strict_date_optional_time||epoch_millis"
},
"status": {
"type": "string",
"index": "not_analyzed"
},
"version": {
"type": "long"
},
"workFlowId": {
"type": "long"
}
}
}
}

继续添加/更新该索引类型的文档,并将 workFlowId作为文档ID。

现在,为了显示每日图表,您可能需要创建另一个索引类型,比如说 per_day_workflow和以下映射:
{
"per_day_workflow": {
"properties": {
"date": {
"type": "date",
"format": "strict_date_optional_time||epoch_millis"
},
"in_progress": {
"type": "long"
},
"completed": {
"type": "long"
}
}
}
}

该索引将保存每天的数据。因此,您需要创建一个将在一天结束时运行的作业,并使用以下聚合搜索从 workflow_status索引类型获取总的“进行中”和“已完成”工作流程:
POST http://localhost:9200/workflow/workflow_status/_search?search_type=count

{
"aggs": {
"per_status": {
"terms": {
"field": "status"
}
}
}
}

响应如下所示(我在您的示例数据中针对日期2015-11-02运行):
{
"took": 3,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 4,
"max_score": 0,
"hits": []
},
"aggregations": {
"per_status": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "In Progress",
"doc_count": 3
},
{
"key": "Completed",
"doc_count": 1
}
]
}
}
}

从此响应中,您需要提取 In ProgressCompleted计数,并将它们添加到具有今天日期的 per_day_workflow索引类型中。

现在,当您每天需要图形数据时,就可以轻松地从这种 per_day_workflow索引类型中获取数据。

关于elasticsearch - 在Elasticsearch中嵌套查询?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/33858755/

27 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com