Mongodb.aggregate() 忽略索引-6ren

Mongodb.aggregate() 忽略索引

转载作者：行者123 更新时间：2023-12-03 15:59:57

25

4

我收集了一些具有大致以下结构的存档任务

{
    "_id" : "job-id_00000001_2017-03-17T21:30:38.510Z",
    "jobId" : "job-id",
    "result" : {
        "status" : "ok"
    },
    "..." : "..."
}

最重要的是，我有索引

jobId: 1
result.status: 1
jobId: 1, result.status: 1

在某些用例中，我需要相当频繁地更新统计信息(映射:job-id -> status -> count)，并且当我执行此聚合函数时......

db.getCollection('jobs_archive').aggregate([
            {$group: {
                _id: {jobId: "$jobId", status: "$result.status"},
                count: { $sum: 1 }
            }}
        ], {explain: true} )

...它在 120 万行上运行约 4 秒，这是 Not Acceptable 长。通过 explain: true 我得到...

"queryPlanner" : {
    "plannerVersion" : 1,
    "namespace" : "db.jobs_archive",
    "indexFilterSet" : false,
    "parsedQuery" : {},
    "winningPlan" : {
        "stage" : "COLLSCAN",
        "direction" : "forward"
    },
    "rejectedPlans" : []
}

...和 COLLSCAN 意味着 Mongo 不使用索引中的数据，但所有字段都在复合索引 jobId: 1, result.status: 1 中可用。

有没有办法优化聚合查询的性能？我做错了什么吗？

<小时/>

(由 Ori Dar 的回答触发的附录)

在深入研究文档后，我注意到“涵盖的查询”，在这种情况下应该使用我认为应该使用的功能。看来并非如此。

涵盖查询 https://docs.mongodb.com/manual/core/query-optimization/#covered-query

A covered query is a query that can be satisfied entirely using an index and does not have to examine any documents. An index covers a query when both of the following apply:

all the fields in the query are part of an index, and

all the fields returned in the results are in the same index.

...

Because the index contains all fields required by the query, MongoDB can both match the query conditions and return the results using only the index.

Querying only the index can be much faster than querying documents outside of the index. Index keys are typically smaller than the documents they catalog, and indexes are typically available in RAM or located sequentially on disk.

<小时/>

Mongo 的更多奇怪之处

(1) db.getCollection('jobs_archive').find({"jobId" : "job-id"}).count()
--> 0.375sec, count = 430000

(2) db.getCollection('archive').find({"jobId" : "job-id", "result.status": "ok"}).count()
--> 1.400sec, count = 430000

explain() 说

获胜计划:IXSCAN/“indexName”:“jobId_1_result.status_1”
获胜计划:IXSCAN/“indexName”:“jobId_1”

所以，如果 Mongo 正确使用索引，我会为“job-id+status”的每个组合使用“query().count()”(它是 6 * 5)，但似乎它不在此列情况也是如此。当我指定两个键“jobId + result.status”复合索引不用于 count() ...并且当我在查询中仅指定一个 jobId 时，复合索引IS 使用... r-r-r-r

注意:Mongo“版本”:“3.4.2”，Ubuntu 16

最佳答案

来自Pipeline Operators and Indexes

Pipeline Operators and Indexes¶

The $match and $sort pipeline operators can take advantage of an index when they occur at the beginning of the pipeline.

MongoDB 不会使用 $group 的索引

您正在进行全面扫描，即所有文档均已处理。因此，使用索引会导致对每个文档进行重复查找:一次针对索引，一次针对文档本身，所以有什么意义。

因此，只有在使用 $match 缩小结果范围时才能使用索引。首先过滤。

作为旁注，{jobId: 1}索引是多余的。

查询优化器可以使用{jobId: 1, result.status: 1}使用以下模式的查询索引:db.jobs_archive.find({jobId: n})

参见Prefixes

关于Mongodb.aggregate() 忽略索引，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/42926458/

25

4

0

文章推荐：单机上的 MongoDB 副本集

文章推荐： php - 如何在 laravel mongodb 上添加软删除？ (拉拉维尔 5.3)

文章推荐： MongoDB 匹配索引与无索引 - 聚合

文章推荐： Ansible playbook 中的 Mongodb 查找错误

aggregate - 什么是 "aggregate"
刚刚收到一条错误消息，内容为“union __anonymous 只能是聚合的一部分”。我对此并不感到困惑，因为我正在尝试一些我知道不应该起作用的东西。但这让我想知道 D 中“聚合”的确切定义是什么
elasticsearch - “Filter then Aggregation”还是“Filter Aggregation”？
我最近在研究ES，发现可以达到几乎相同的结果，但是对于这两者之间的 DIFFERENCE ，我不清楚。 "Filter then Aggregation" POST kibana_sample_dat
sql - 更改查询以避免 Bigquery 中的 "Aggregations of aggregations are not allowed"
给定用户和订单表，我需要计算在注册日期后的第二天首次下单的用户。我设法通过以下查询列出了此类用户: SELECT users.first_name as first_name, users.
elasticsearch - Bucket_script aggregation on filters aggregation over nested documents
我有我的文档，它们包含嵌套的“事件”(如网站上的点击)文档。现在我想计算 name=x 的嵌套事件和 name=y 的嵌套事件之间的比率这是我的查询: curl -XGET http://192.
architecture - 领域驱动设计 : Aggregate root & Sub Aggregate roots
在我的项目中，我发现需要以分层方式打破我的聚合，使用顶级根级别聚合，以确保根级别的规则一致性，然后我的根下的对象可以分组为各种聚合。在计算根级聚合的完整性时，根验证自己的规则，然后委托(delegat
Spring 数据 MongoDB : How to describe aggregation $merge with Spring Aggregation?
我想通过 MongoTemplate 执行的代码: { $merge: { into: 'someCollection', on: "_id",
domain-driven-design - DDD : Getting aggregate roots for other aggregates
在过去的两周里，我一直在研究DDD，而真正令我难忘的一件事是聚合根如何包含其他聚合根。从存储库中检索聚合根，但是如果一个根包含另一个根，该存储库是否具有对另一个存储库的引用，并要求其构建子根？最佳答
domain-driven-design - DDD : Aggregate design - Referencing between aggregates
我对如何设计聚合有疑问。我有Company , City , Province和 Country实体。其中每一个都需要是其自身聚合的聚合根。 City , Province和 Country实体在整
domain-driven-design - DDD : Aggregate design - Referencing between aggregates
我对如何设计聚合有疑问。我有Company , City , Province和 Country实体。其中每一个都需要是其自身聚合的聚合根。 City , Province和 Country实体在整
c# - DDD : Aggregate Root accessed by another aggregate root
我目前正在开发 DDD 应用程序，我对如何处理似乎必须从另一个聚合根访问聚合根的场景感到困惑。这是我的边界上下文的概述: 用户可以加入该站点并就他们感兴趣的主题创建帖子。他们还可以创建群组并针对他们创
domain-driven-design - 解决框架 : How to properly derive an aggregate's state from the state of other aggregates?
我正在用 reSolve 做我的第一个项目，但在 DDD、ES 和 CQRS 方面的经验有限。所以，也许有一个非常简单的解决方案，但我还没有找到。我的问题:在我的项目中，一个聚合的状态(订单状态)实
azure - Azure Cosmos cassandra 数据库中的 "Cannot have aggregate and non-aggregate selectors in query"
https://howtoprogram.xyz/2017/02/18/using-group-apache-cassandara/ 我试图在 azure cosmos cassandra db 中执
java - 西提 CEP : Aggregate functions with time window don't "remove" values from aggregation
使用 Siddhi 3.0.3 作为 Java 库。我通过扩展 AttributeAggregator 类开发了自定义聚合函数，并且在调用 processRemove() 方法后我看到了一些奇怪的行
design-patterns - DDD : Can an Aggregate Root be an Entity within another Aggregate Root?
我正在尝试对一个公司拥有许多团队的问题进行建模。有一条业务规则“每个公司的团队名称必须是唯一的”。然而，团队还有许多其他行为，例如加入。此外，一个团队可以有许多报告 - 它们维护对Team.Id的引用
sql-server - T-SQL : Cannot perform an aggregate function on an expression containing an aggregate or a subquery
我正在尝试将总计的结果相加并将其减去总计，但我看到以下错误: 想象一下这样的事情第一个子查询:1 3 5 7第二个子查询:2 4 6 总计:(1+3+5+7) - (2+4+6) = 4 这是我的查
c# - DDD : one-to-many relationship between user aggregate root and almost all entities in other aggregates
我有以下 DDD 场景，分为以下聚合: 用户， friend (用户协会)，文件(供用户上传)，图库(文件分组)，消息(用户通信)，群组(用户可以创建，其他成员可以加入)， GroupMess
SQL Server "cannot perform an aggregate function on an expression containing an aggregate or a subquery"，但 Sybase 可以
这个问题之前已经讨论过，但没有一个答案能解决我的具体问题，因为我正在处理内部和外部选择中的不同 where 子句。该查询在 Sybase 下执行得很好，但在 SQL Server 下执行时会出现本文标
azure - 流分析: How can I start and stop a TUMBLINGWINDOW aggregation job inorder to reduce costs while still getting the same aggregation results?
上下文我使用 Azure 门户创建了一个流作业，该门户使用每日 TUMBLINGWINDOW 聚合数据。下面附上了一个代码片段，修改自 docs ，这显示了类似的逻辑。 SELECT DAT
mysql错误 "ERROR 3029 (HY000): Expression #1 of ORDER BY contains aggregate function and applies to the result of a non-aggregated query"
我正在执行以下查询 SELECT DISTINCT n.nid AS entity_id FROM node n INNER JOIN og_membership om ON n.nid=om.eti
aggregation - 如何聚合普罗米修斯指标
我的各种 docker 容器都导出 prometheus 指标，但是我们的 prometheus 安装只需要从一个端点提取所有指标。不幸的是，这无法更改。因此，我需要通过安装普罗米修斯来收集所有指标。

首页

博学

6Ren·AI

商城

Mongodb.aggregate() 忽略索引