gpt4 book ai didi

Mongodb.aggregate() 忽略索引

转载 作者:行者123 更新时间:2023-12-03 15:59:57 25 4
gpt4 key购买 nike

我收集了一些具有大致以下结构的存档任务

{
"_id" : "job-id_00000001_2017-03-17T21:30:38.510Z",
"jobId" : "job-id",
"result" : {
"status" : "ok"
},
"..." : "..."
}

最重要的是,我有索引

jobId: 1
result.status: 1
jobId: 1, result.status: 1

在某些用例中,我需要相当频繁地更新统计信息(映射:job-id -> status -> count),并且当我执行此聚合函数时......

db.getCollection('jobs_archive').aggregate([
{$group: {
_id: {jobId: "$jobId", status: "$result.status"},
count: { $sum: 1 }
}}
], {explain: true} )

...它在 120 万行上运行约 4 秒,这是 Not Acceptable 长。通过 explain: true 我得到...

"queryPlanner" : {
"plannerVersion" : 1,
"namespace" : "db.jobs_archive",
"indexFilterSet" : false,
"parsedQuery" : {},
"winningPlan" : {
"stage" : "COLLSCAN",
"direction" : "forward"
},
"rejectedPlans" : []
}

...和 ​​COLLSCAN 意味着 Mongo 不使用索引中的数据,但所有字段都在复合索引 jobId: 1, result.status: 1 中可用。

有没有办法优化聚合查询的性能?我做错了什么吗?

<小时/>

(由 Ori Dar 的回答触发的附录)

在深入研究文档后,我注意到“涵盖的查询”,在这种情况下应该使用我认为应该使用的功能。看来并非如此。

涵盖查询 https://docs.mongodb.com/manual/core/query-optimization/#covered-query

A covered query is a query that can be satisfied entirely using an index and does not have to examine any documents. An index covers a query when both of the following apply:

  • all the fields in the query are part of an index, and
  • all the fields returned in the results are in the same index.

...

Because the index contains all fields required by the query, MongoDB can both match the query conditions and return the results using only the index.

Querying only the index can be much faster than querying documents outside of the index. Index keys are typically smaller than the documents they catalog, and indexes are typically available in RAM or located sequentially on disk.

<小时/>

Mongo 的更多奇怪之处

(1) db.getCollection('jobs_archive').find({"jobId" : "job-id"}).count()
--> 0.375sec, count = 430000

(2) db.getCollection('archive').find({"jobId" : "job-id", "result.status": "ok"}).count()
--> 1.400sec, count = 430000

explain()

  1. 获胜计划:IXSCAN/“indexName”:“jobId_1_result.status_1”
  2. 获胜计划:IXSCAN/“indexName”:“jobId_1”

所以,如果 Mongo 正确使用索引,我会为“job-id+status”的每个组合使用“query().count()”(它是 6 * 5),但似乎它不在此列情况也是如此。当我指定两个键“jobId + result.status”复合索引不用于 count() ...并且当我在查询中仅指定一个 jobId 时,复合索引IS 使用... r-r-r-r

注意:Mongo“版本”:“3.4.2”,Ubuntu 16

最佳答案

来自Pipeline Operators and Indexes

Pipeline Operators and Indexes¶

The $match and $sort pipeline operators can take advantage of an index when they occur at the beginning of the pipeline.

MongoDB 不会使用 $group 的索引

您正在进行全面扫描,即所有文档均已处理。因此,使用索引会导致对每个文档进行重复查找:一次针对索引,一次针对文档本身,所以有什么意义。

因此,只有在使用 $match 缩小结果范围时才能使用索引。首先过滤。

作为旁注,{jobId: 1}索引是多余的。

查询优化器可以使用{jobId: 1, result.status: 1}使用以下模式的查询索引:db.jobs_archive.find({jobId: n})

参见Prefixes

关于Mongodb.aggregate() 忽略索引,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/42926458/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com