python - Python 的 collections.Counter 和 nltk.probability.FreqDist 之间的区别-6ren

python - Python 的 collections.Counter 和 nltk.probability.FreqDist 之间的区别

转载作者：太空狗更新时间：2023-10-30 00:49:01

28

4

我想计算文本语料库中单词的词频。我一直在使用 NLTK 的 word_tokenize 后跟 probability.FreqDist 一段时间来完成这项工作。 word_tokenize 返回一个列表，该列表由 FreqDist 转换为频率分布。然而，我最近遇到了集合中的计数器函数 (collections.Counter)，它似乎在做完全相同的事情。 FreqDist 和 Counter 都有一个 most_common(n) 函数，它返回 n 个最常见的单词。有谁知道这两者之间是否有区别？一个比另一个快吗？是否存在其中一个行得通而另一个行不通的情况？

最佳答案

nltk.probability.FreqDist 是 collections.Counter 的子类。

来自docs :

A frequency distribution for the outcomes of an experiment. A frequency distribution records the number of times each outcome of an experiment has occurred. For example, a frequency distribution could be used to record the frequency of each word type in a document. Formally, a frequency distribution can be defined as a function mapping from each sample to the number of times that sample occurred as an outcome.

The inheritance is explicitly shown from the code从本质上讲，Counter 和 FreqDist 的初始化方式没有区别，请参阅 https://github.com/nltk/nltk/blob/develop/nltk/probability.py#L106

所以在速度方面，创建一个 Counter 和 FreqDist 应该是一样的。速度上的差异应该是微不足道的，但值得注意的是，开销可能是:

在解释器中定义类时编译
鸭子打字的成本.__init__()

主要区别在于 FreqDist 为统计/概率自然语言处理 (NLP) 提供的各种函数，例如finding hapaxes . FreqDist 扩展 Counter 的完整函数列表如下:

>>> from collections import Counter
>>> from nltk import FreqDist
>>> x = FreqDist()
>>> y = Counter()
>>> set(dir(x)).difference(set(dir(y)))
set(['plot', 'hapaxes', '_cumulative_frequencies', 'r_Nr', 'pprint', 'N', 'unicode_repr', 'B', 'tabulate', 'pformat', 'max', 'Nr', 'freq', '__unicode__'])

当谈到使用FreqDist.most_common()时，它实际上是在使用Counter的父函数，所以检索排序的most_common的速度> 两种类型的列表相同。

就个人而言，当我只想检索计数时，我使用 collections.Counter。但是，当我需要进行一些统计操作时，我要么使用 nltk.FreqDist，要么将 Counter 转储到 pandas.DataFrame 中(请参阅Transform a Counter object into a Pandas DataFrame)。

关于python - Python 的 collections.Counter 和 nltk.probability.FreqDist 之间的区别，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/34603922/

28

4

0

文章推荐： c# - 排除以特定字符开头的正则表达式匹配

文章推荐： c# - 在 C# 中保持对象存活

文章推荐： python - ValueError : not enough values to unpack (expected 11, 得到 1)

python - Counter()+=Counter 和 Counter.update(Counter) 哪个更快？
哪个更快？ Counter()+=Counter 或 Counter.update(Counter)？为什么一个比另一个更快？我尝试了一些简单的分析，但我认为这不足以最终证明 Counter+=C
javascript - counter++ 与 counter = counter+1;
这个问题在这里已经有了答案: ++someVariable vs. someVariable++ in JavaScript (7 个答案) 关闭 7 年前。 var counter = 0; va
ios - counter++/counter-- 没有按预期工作
下面是我正在使用的代码。如果我按 addQuanity m_label 设置显示一个而不是两个。如果我再次按 addWuantity，m_label 显示 2。按 minusQuantity 将 m_
java - 为什么 "false && counter()"中没有调用 counter() ？
这个问题已经有答案了: Does Java evaluate remaining conditions after boolean result is known? (7 个回答) 已关闭 6 年前。
python - 可以让 Counter 不写出 "Counter"吗？
因此，当我将计数器(from collections import Counter)打印到一个文件时，我总是得到它的文字 Counter ({'Foo': 12}) 有没有办法让计数器不那么字面地写出
css - 我可以样式内容 :counter based on counter value?
我正在使用 CSS2.1 计数器将数字应用于棋盘上的人，以实现棋盘游戏，其棋盘图使用 HTML 和 CSS，方法如下: .ply {counter-increment:main;} .move:be
c++ - 在 for 循环中使用++counter 而不是 counter++
这个问题在这里已经有了答案: 关闭 11 年前。 Possible Duplicate: Is there a performance difference between i++ and ++i
c++ - 错误:没有用于调用 Counter::Counter() 的匹配函数
我在尝试编译 Arduino 草图时遇到此错误。我看不出它认为我试图在没有参数的情况下调用 Counter::Counter 的地方。这是怎么回事？ sketch/periodic_effect.cp
PowerShell Get-Counter 命令，-ComputerName 与 -Counter
调用Get-Counter时使用-ComputerName参数和使用-Counter参数中的路径有区别吗？ Get-Counter -Counter "\Memory\Available MB
python - 如何处理名称冲突 collections.Counter 和 typing.Counter？
姓名 Counter在 collections 中都定义了(作为一个类(class))和在 typing (作为通用类型名称)。不幸的是，它们略有不同。处理这个问题的推荐方法是什么？相同点和不同点:
linux - 为什么 ((counter++)) 在 counter == 0 时失败？
此代码不会给出任何失败，但如果您使用 counter++，则第一次迭代会失败。 parameters="one two three" counter=0 for option in $param
date - Powershell 中的 get-counter/export-counter 返回的时间格式错误
powershell 中的 get-counter/export-counter cmdlet 似乎以美国格式返回日期，这在这种情况下是相当不受欢迎的。我浏览了两个 get-help -full 页面
Python 将 Counter 附加到 Counter，就像 Python 字典更新一样
我有 2 个计数器(来自集合的计数器)，我想将一个附加到另一个，而第一个计数器的重叠键将被忽略。喜欢 dic.update (python 词典更新) 例如: from collections imp
unit-testing - 单元测试 -> 无法在此 ChangeNotifierProvider 小部件上方找到正确的 Provider
我想在我的项目中为 Provider ( ChangeNotifierProvider ) 创建一个单元测试，我的单元测试、小部件测试和集成测试成功通过 ✔️，所以现在我尝试(努力尝试🥵...)创建
c - 为什么 counter = counter/2;有 O(log(n))？
我知道以下代码的复杂度为 O(log(n)): while (n>1) { counter++; n/=2; } 我知道在这里，n 在每次迭代中被分成两半，这意味着如果 n 是 100
java - Hadoop 方法 Counter.getName 和 Counter.getDisplayName 之间的区别
Counter.getName() 方法与 Counter.getDisplayName() 方法有什么区别。我没有从文档中看到太多信息 http://hadoop.apache.org/docs/r
python - "Counters from Step 1: No Counters found"使用 Hadoop 和 mrjob
我有一个 python 文件，用于在 Hadoop(版本 2.6.0)上使用 mrjob 来计算二元语法，但我没有得到我希望的输出，而且我在破译终端中的输出时遇到了问题我哪里出错了。我的代码: re
iis - 如何解决错误 "It has taken too long to refresh the W3SVC counters, the stale counters are being used instead"
我看到带有错误消息的事件 ID 2001: It has taken too long to refresh the W3SVC counters , the stale counters are b
javascript - 找不到模块 :Can't resolve './components/counter' in 'C:/...demo\counter-app\src'
我对 React 完全陌生，我正在 YouTube 上学习教程(使用 MOSH 编程)，但我遇到了这个错误，在找到类似问题后无法解决。 index.js import React from 'reac
java - 组织.apache.hadoop.mapreduce.counters.LimitExceededException : Too many counters: 121 max=120
我正在运行一个 hadoop 作业(来自 oozie)，它有几个计数器和多输出。我收到如下错误:org.apache.hadoop.mapreduce.counters.LimitExceededE

首页

博学

6Ren·AI

商城

python - Python 的 collections.Counter 和 nltk.probability.FreqDist 之间的区别