python mongodb $match 和 $group-6ren

python mongodb $match 和 $group

转载作者：行者123 更新时间：2023-11-30 23:12:07

我想编写一个简单的查询，为我提供拥有最多关注者、时区为巴西且已发推文 100 次或以上的用户:

这是我的台词:

pipeline = [{'$match':{"user.statuses_count":{"$gt":99},"user.time_zone":"Brasilia"}},
            {"$group":{"_id": "$user.followers_count","count" :{"$sum":1}}},
            {"$sort":{"count":-1}} ]

我根据练习题改编了它。

This was given as an example for the structure :
    {
    "_id" : ObjectId("5304e2e3cc9e684aa98bef97"),
    "text" : "First week of school is over :P",
    "in_reply_to_status_id" : null,
    "retweet_count" : null,
    "contributors" : null,
    "created_at" : "Thu Sep 02 18:11:25 +0000 2010",
    "geo" : null,
    "source" : "web",
    "coordinates" : null,
    "in_reply_to_screen_name" : null,
    "truncated" : false,
    "entities" : {
        "user_mentions" : [ ],
        "urls" : [ ],
        "hashtags" : [ ]
    },
    "retweeted" : false,
    "place" : null,
    "user" : {
        "friends_count" : 145,
        "profile_sidebar_fill_color" : "E5507E",
        "location" : "Ireland :)",
        "verified" : false,
        "follow_request_sent" : null,
        "favourites_count" : 1,
        "profile_sidebar_border_color" : "CC3366",
        "profile_image_url" : "http://a1.twimg.com/profile_images/1107778717/phpkHoxzmAM_normal.jpg",
        "geo_enabled" : false,
        "created_at" : "Sun May 03 19:51:04 +0000 2009",
        "description" : "",
        "time_zone" : null,
        "url" : null,
        "screen_name" : "Catherinemull",
        "notifications" : null,
        "profile_background_color" : "FF6699",
        "listed_count" : 77,
        "lang" : "en",
        "profile_background_image_url" : "http://a3.twimg.com/profile_background_images/138228501/149174881-8cd806890274b828ed56598091c84e71_4c6fd4d8-full.jpg",
        "statuses_count" : 2475,
        "following" : null,
        "profile_text_color" : "362720",
        "protected" : false,
        "show_all_inline_media" : false,
        "profile_background_tile" : true,
        "name" : "Catherine Mullane",
        "contributors_enabled" : false,
        "profile_link_color" : "B40B43",
        "followers_count" : 169,
        "id" : 37486277,
        "profile_use_background_image" : true,
        "utc_offset" : null
    },
    "favorited" : false,
    "in_reply_to_user_id" : null,
    "id" : NumberLong("22819398300")
}

有人能发现我的错误吗？

最佳答案

假设您有几个包含最小测试用例的示例文档。将测试文档插入到mongoshell中的集合中:

db.collection.insert([
{
    "_id" : ObjectId("5304e2e3cc9e684aa98bef97"),
    "user" : {
        "friends_count" : 145,
        "statuses_count" : 457,
        "screen_name" : "Catherinemull",
        "time_zone" : "Brasilia",
        "followers_count" : 169,
        "id" : 37486277
    },
    "id" : NumberLong(22819398300)
},
{
    "_id" : ObjectId("52fd2490bac3fa1975477702"),
    "user" : {
        "friends_count" : 145,
        "statuses_count" : 12334,
        "time_zone" : "Brasilia",
        "screen_name" : "marble",
        "followers_count" : 2597,
        "id" : 37486278
    },
    "id" : NumberLong(22819398301)
}])

为了让您获得时区 “巴西利亚” 内拥有最多关注者且发推文 100 次或以上的用户，此管道实现了预期结果，但不使用$group运算符:

pipeline = [
    {
        "$match": {
            "user.statuses_count": {
                "$gt":99 
            }, 
            "user.time_zone": "Brasilia"
        }
    },
    {
        "$project": {                
            "followers": "$user.followers_count",
            "screen_name": "$user.screen_name",
            "tweets": "$user.statuses_count"
        }
    },
    {
        "$sort": { 
            "followers": -1 
        }
    },
    {"$limit" : 1}
]

Pymongo 输出:

{u'ok': 1.0,
 u'result': [{u'_id': ObjectId('52fd2490bac3fa1975477702'),
              u'followers': 2597,
              u'screen_name': u'marble',
              u'tweets': 12334}]}

以下聚合管道也将为您提供所需的结果。在管道中，第一阶段是 $match运算符，用于过滤用户已获取 timezone 字段值 “Brasilia” 且推文计数(由 statuses_count 表示)更大的文档大于或等于 100 通过 $gte 匹配比较运算符。

第二个管道阶段有 $group运算符，它按指定的标识符表达式(即 $user.id 字段)对过滤后的文档进行分组，并应用累加器表达式 $max到 $user.followers_count 字段中的每个组，以获得每个用户的最大关注者数量。系统变量$$ROOT它引用根文档，即当前正在 $group 中处理的顶级文档。聚合管道阶段，被添加到额外的数组字段中以供稍后使用。这是通过使用 $addToSet 来实现的数组运算符。

下一个管道阶段 $unwinds为 data 数组中的每个元素输出一个文档，以便在下一步中进行处理。

以下管道步骤，$project ，然后通过添加具有前一个流中的值的新字段来转换流中的每个文档。

最后两个管道阶段 $sort和 $limit按指定的排序键followers 对文档流重新排序，并返回包含拥有最多关注者数量的用户的文档。

最终的聚合管道应如下所示:

db.collection.aggregate([
    {
        '$match': { 
            "user.statuses_count": { "$gte": 100 },
            "user.time_zone": "Brasilia"
        }
    },
    {
        "$group": {
            "_id": "$user.id",
            "max_followers": { "$max": "$user.followers_count" },
            "data": { "$addToSet": "$$ROOT" }
        }
    },
    {
        "$unwind": "$data"
    },   
    {
        "$project": {
            "_id": "$data._id",
            "followers": "$max_followers",
            "screen_name": "$data.user.screen_name",
            "tweets": "$data.user.statuses_count"
        }
    }, 
    {
        "$sort": { "followers": -1 }
    },
    {
        "$limit" : 1
    }
])

在 Robomongo 中执行此操作会得到结果

/* 0 */
{
    "result" : [ 
        {
            "_id" : ObjectId("52fd2490bac3fa1975477702"),
            "followers" : 2597,
            "screen_name" : "marble",
            "tweets" : 12334
        }
    ],
    "ok" : 1
}

在 python 中，实现本质上应该是相同的:

>>> pipeline = [
...     {"$match": {"user.statuses_count": {"$gte":100 }, "user.time_zone": "Brasilia"}},
...     {"$group": {"_id": "$user.id","max_followers": { "$max": "$user.followers_count" },"data": { "$addToSet": "$$ROO
T" }}},
...     {"$unwind": "$data"},
...     {"$project": {"_id": "$data._id","followers": "$max_followers","screen_name": "$data.user.screen_name","tweets":
 "$data.user.statuses_count"}},
...     {"$sort": { "followers": -1 }},
...     {"$limit" : 1}
... ]
>>>
>>> for doc in collection.aggregate(pipeline):
...     print(doc)
...
{u'tweets': 12334.0, u'_id': ObjectId('52fd2490bac3fa1975477702'), u'followers': 2597.0, u'screen_name': u'marble'}
>>>

哪里

pipeline = [
    {"$match": {"user.statuses_count": {"$gte":100 }, "user.time_zone": "Brasilia"}},
    {"$group": {"_id": "$user.id","max_followers": { "$max": "$user.followers_count" },"data": { "$addToSet": "$$ROOT" }}},
    {"$unwind": "$data"},   
    {"$project": {"_id": "$data._id","followers": "$max_followers","screen_name": "$data.user.screen_name","tweets": "$data.user.statuses_count"}}, 
    {"$sort": { "followers": -1 }},
    {"$limit" : 1}
]

关于python mongodb $match 和 $group，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/29942368/

文章推荐： Mysql date_sub 间隔12个月

文章推荐： c# - XElement 从 XDocument 解析，远在下面，重复 3

文章推荐： mysql - 精确词搜索中的单词灵活性

文章推荐： c# - JSON反序列化不调用属性方法

python - Python 中的集群或合并集群以减少组数 (Python)
我正在处理一组标记为 160 个组的 173k 点。我想通过合并最接近的(到 9 或 10 个组)来减少组/集群的数量。我搜索过 sklearn 或类似的库，但没有成功。我猜它只是通过 knn 聚类
python - python 列表的子集基于同一列表的元素组，pythonically
我有一个扁平数字列表，这些数字逻辑上以 3 为一组，其中每个三元组是 (number, __ignored, flag[0 or 1])，例如: [7,56,1, 8,0,0, 2,0,0, 6,1,
python - 激活 Python 虚拟环境并在另一个 Python 脚本中调用 Python 脚本
我正在使用 pipenv 来管理我的包。我想编写一个 python 脚本来调用另一个使用不同虚拟环境(VE)的 python 脚本。如何运行使用 VE1 的 python 脚本 1 并调用另一个 p
python - 在焕然一新的 Python 环境中以编程方式从 Python 内部执行 Python 文件
假设我有一个文件 script.py 位于 path = "foo/bar/script.py"。我正在寻找一种在 Python 中通过函数 execute_script() 从我的主要 Python
python - 从 python 脚本但在 python 脚本之外运行 python 脚本
这听起来像是谜语或笑话，但实际上我还没有找到这个问题的答案。问题到底是什么？我想运行 2 个脚本。在第一个脚本中，我调用另一个脚本，但我希望它们继续并行，而不是在两个单独的线程中。主要是我不希望第
python - 使用不同的 python 从 python 运行 python 脚本
我有一个带有 python 2.5.5 的软件。我想发送一个命令，该命令将在 python 2.7.5 中启动一个脚本，然后继续执行该脚本。我试过用 #!python2.7.5 和http://re
python - 为什么从 Python 命令行调用 Python 时 Python 无法找到并运行我的脚本？
我在 python 命令行(使用 python 2.7)中，并尝试运行 Python 脚本。我的操作系统是 Windows 7。我已将我的目录设置为包含我所有脚本的文件夹，使用: os.chdir("
python - 使用动态版本的 Python 执行嵌入的 Python 代码时出现致命的 Python 错误
剧透:部分解决(见最后)。以下是使用 Python 嵌入的代码示例: #include int main(int argc, char** argv) { Py_SetPythonHome
python - python 中识别 python 数组或列表中最大累积差异的最快方法是什么？
假设我有以下列表，对应于及时的股票价格: prices = [1, 3, 7, 10, 9, 8, 5, 3, 6, 8, 12, 9, 6, 10, 13, 8, 4, 11] 我想确定以下总体上最
python - (Python) 通过单选按钮 python 更新背景
所以我试图在选择某个单选按钮时更改此框架的背景。我的框架位于一个类中，并且单选按钮的功能位于该类之外。 (这样我就可以在所有其他框架上调用它们。) 问题是每当我选择单选按钮时都会出现以下错误: co
python - python 中的字符串与正则表达式比较在 python 中失败
我正在尝试将字符串与 python 中的正则表达式进行比较，如下所示， #!/usr/bin/env python3 import re str1 = "Expecting property name
python - python 如何加载Boost.Python 库？
考虑以下原型(prototype) Boost.Python 模块，该模块从单独的 C++ 头文件中引入类“D”。 /* file: a/b.cpp */ BOOST_PYTHON_MODULE(c)
python - python 检查模块 python 的问题
如何编写一个程序来“识别函数调用的行号？” python 检查模块提供了定位行号的选项，但是， def di(): return inspect.currentframe().f_back.f_l
python - 系统 python 与用户 python
我已经使用 macports 安装了 Python 2.7，并且由于我的 $PATH 变量，这就是我输入 $ python 时得到的变量。然而，virtualenv 默认使用 Python 2.6，除
python - [Python] : Python re. 长字符串行的搜索速度优化
我只想问如何加快 python 上的 re.search 速度。我有一个很长的字符串行，长度为 176861(即带有一些符号的字母数字字符)，我使用此函数测试了该行以进行研究: def getExe
python - 编辑字符串 python 正则表达式 python
list1= [u'%app%%General%%Council%', u'%people%', u'%people%%Regional%%Council%%Mandate%', u'%ppp%%Ge
python - Python 映射中的副作用(Python "do" block )
这个问题在这里已经有了答案: Is it Pythonic to use list comprehensions for just side effects? (7 个答案) 关闭 4 个月前。告
python - 使用其值逻辑组合两个 python 列表 - Python
我想用 Python 将两个列表组合成一个列表，方法如下: a = [1,1,1,2,2,2,3,3,3,3] b= ["Sun", "is", "bright", "June","and" ,"Ju
python - Boost.Python python 链接错误
我正在运行带有最新 Boost 发行版 (1.55.0) 的 Mac OS X 10.8.4 (Darwin 12.4.0)。我正在按照说明 here构建包含在我的发行版中的教程 Boost-Pyth
python - 在 Python 中仅使用内置库制作一个基本的网络抓取工具 - Python
学习 Python，我正在尝试制作一个没有任何第 3 方库的网络抓取工具，这样过程对我来说并没有简化，而且我知道我在做什么。我浏览了一些在线资源，但所有这些都让我对某些事情感到困惑。 html 看起来

行者123

个人简介

我是一名优秀的程序员,十分优秀！

作者热门文章

滴滴打车优惠券免费领取

全站热门文章

首页

博学

6Ren·AI

商城

python mongodb $match 和 $group