mongodb - 蒙哥 : count the number of word occurrences in a set of documents-6ren

mongodb - 蒙哥 : count the number of word occurrences in a set of documents

转载作者：IT老高更新时间：2023-10-28 13:09:49

26

4

我在 Mongo 中有一组文档。说:

[
    { summary:"This is good" },
    { summary:"This is bad" },
    { summary:"Something that is neither good nor bad" }
]

我想计算每个单词的出现次数(不区分大小写)，然后按降序排序。结果应该是这样的:

[
    "is": 3,
    "bad": 2,
    "good": 2,
    "this": 2,
    "neither": 1,
    "nor": 1,
    "something": 1,
    "that": 1
]

知道怎么做吗？聚合框架将是首选，因为我已经在某种程度上理解它:)

最佳答案

MapReduce可能非常适合在服务器上处理文档而无需在客户端上进行操作(因为在 DB 服务器上没有拆分字符串的功能 (open issue)。

从 map 函数开始。在下面的示例中(可能需要更健壮)，每个文档都被传递给 map 函数(作为 this)。代码查找 summary 字段，如果存在，则将其小写，在空格上拆分，然后为找到的每个单词发出 1。

var map = function() {  
    var summary = this.summary;
    if (summary) { 
        // quick lowercase to normalize per your requirements
        summary = summary.toLowerCase().split(" "); 
        for (var i = summary.length - 1; i >= 0; i--) {
            // might want to remove punctuation, etc. here
            if (summary[i])  {      // make sure there's something
               emit(summary[i], 1); // store a 1 for each word
            }
        }
    }
};

然后，在 reduce 函数中，它将 map 函数找到的所有结果相加，并为 emit< 的每个单词返回一个离散的总数上面写的。

var reduce = function( key, values ) {    
    var count = 0;    
    values.forEach(function(v) {            
        count +=v;    
    });
    return count;
}

最后，执行 mapReduce:

> db.so.mapReduce(map, reduce, {out: "word_count"})

您的样本数据的结果:

> db.word_count.find().sort({value:-1})
{ "_id" : "is", "value" : 3 }
{ "_id" : "bad", "value" : 2 }
{ "_id" : "good", "value" : 2 }
{ "_id" : "this", "value" : 2 }
{ "_id" : "neither", "value" : 1 }
{ "_id" : "or", "value" : 1 }
{ "_id" : "something", "value" : 1 }
{ "_id" : "that", "value" : 1 }

关于mongodb - 蒙哥 : count the number of word occurrences in a set of documents，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/16174591/

26

4

0

文章推荐： mongodb - 查找具有最接近整数值的文档

文章推荐： go - 如何用 Go 编程语言处理并行 HTTP 请求？

文章推荐： go - Go中的链接函数？

文章推荐： Go:可变参数函数和太多参数？

安卓 : Deleting an occurrence of event deletes all of its occurrence
我知道，我不是第一个问这个的人。我在堆栈本身中发现了很多问题，比如 Delete only one instance of a recurring event from my Android cale
python : replace multiple occurrence of characater by one but single occurrence by none
我有一个字符串: a = '0202201131181' 我想用单个 1 替换 a 中所有多次出现的 1 ，但如果只有一次 '1 ' 找到然后用空字符串 '' 替换它。我的最终目标是: a = '0
mysql - SQL 查询 :Repeated Occurrences of user-login after (and including) first occurrence
我需要在给定日期之后从 loginhistory 表中查找重复出现的用户登录。我尝试了以下查询，但它给出了零行。 loginhistory 表有两列，即 userkeyid 和 datecreated
r - 合并非关键变量的所有 "occurrences"
我有两个数据集，我想要的可能被松散地称为“非关键变量的外部连接”。这是数据集数据集 1 oc oc2 state_id r_state A011 A01 1808 1.00
java - 使用二分查找插入 ArrayList
因此，此方法会传递一个 Occurences 数组列表，其中每个包含一个字符串和一个频率。频率是这里唯一重要的部分。但我需要做的是使用二分搜索将 arraylist 中的最后一个元素插入到排序位置。每
Python 正则表达式 : having trouble with # of occurrance
谁能告诉我为什么以下内容不匹配: >>> re.search(r'(\d{2, 10})', '153') 这个匹配: >>> re.search(r'\d{3}', '153') 最佳答案 re模
coq - "Non strictly positive occurrence of ..."
我尝试定义以下类型 Inductive t : Type -> Type := | I : t nat | F : forall A, (t nat -> t A) -> t A. 我收到以下
haskell - 即使在合格的导入后也出现 "Ambiguous occurrence"错误
这是我的代码片段: import Control.Monad.State as S get x = x + 1 现在，如果我尝试使用 get，我会收到以下错误: Ambiguous occurrenc
netbeans - Netbeans中垂直条中 "mark occurrences"的颜色
当您在 NetBeans 7 中选择一个变量时，使用 PHP(也适用于其他语言)，程序会突出显示文件中使用相同变量的所有位置。我知道如何更改实际突出显示文本的颜色(在 Options->Fonts
Java 正则表达式 : Matching Multiple Occurrences
这个问题已经有答案了: How to get multiple regex matches in Java? (2 个回答) 已关闭 3 年前。我有一个电话号码和其他文本列表，如下所示: +1-70
Java Occurrence in Linked List 一遍又一遍地打印
这个问题已经有答案了: How to Count Repetition of Words in Array List? (5 个回答) 已关闭5 年前。我使用 Collections.frequen
php - 计算值的出现次数并将其作为 value_name=>occurrences 对返回
我有一个翻译表，其中有一列重复包含值(语言名称)。如下: id | neutral text | language | translation ---+--------------+---------
MYSQL Order By Occurrence Count 使用两列的值
我已尝试实现这篇文章中的想法，以按每对的出现顺序对我的输出进行排序 - MySQL: Count occurrences of distinct values 我需要的是能够考虑两个不同的列，而不仅仅
python - lark : All occurrences but the last one?
假设我有以下内容: items : (item separator)+ 这适用于: i1, i2, i3, 但不适用于: i1, i2, i3 如何做到不需要结尾分隔符？最佳答案这看起来更像是您需
iOS : how to replace occurrences of in NSArray
这个问题在这里已经有了答案: Replace all NSNull objects in an NSDictionary (8 个答案) 关闭 8 年前。我有一个 NSArray其中包含 stri
arrays - 散列的散列 : How to get the number of occurrences of a key?
我有以下文本文件。 foo1 bam foo1 bam foo2 bam foo1 zip foo2 boo foo1 zip foo3 zip 我想制作一个
eclipse - 在新的 Eclipse 编辑器中实现 "Mark Occurrences"
关闭。这个问题不满足Stack Overflow guidelines .它目前不接受答案。想改善这个问题吗？更新问题，使其成为 on-topic对于堆栈溢出。 7年前关闭。 Improve thi
regex - OCaml 正则表达式 : specify a number of occurrences
在 OCaml 中，如何在正则表达式中指定模式的出现次数？我浏览了 Str 模块，找不到 {n} 的等效项量词。例如，如果我想指定一个“年份”模式，即正好 4 位数字，除了执行 "[0-9][0-9
R/数据表 : separate columns and count occurrences
我有一个很大的data.table(这里只显示了五行)。 taxpath
Sql 查询 : co-occurrence of column values
我有一张这样的表: col1 col2 id1 item1 id1 item2 id1 item3 id2 item1 id2 item4 i

首页

博学

6Ren·AI

商城

mongodb - 蒙哥 : count the number of word occurrences in a set of documents