python - gensim中filter_extreme的使用误区-6ren

python - gensim中filter_extreme的使用误区

转载作者：行者123 更新时间：2023-12-01 08:52:10

25

4

import gensim
corpus = [["a","b","c"],["a","d","e"],["a","f","g"]]
from gensim.corpora import Dictionary
dct = Dictionary(corpus)
print(dct)
dct.filter_extremes(no_below=1)
print(dct)

当我运行上面的代码时，我的输出是 -

Dictionary(7 unique tokens: ['a', 'b', 'c', 'd', 'e']...)
Dictionary(6 unique tokens: ['b', 'c', 'd', 'e', 'f']...)

我认为由于“a”出现在两个文档中，因此不应将其删除。然而，这种情况并非如此。我错过了什么吗？

最佳答案

查看documentation of filter_extremes :

filter_extremes(no_below=5, no_above=0.5, keep_n=100000, keep_tokens=None)

Notes:    
This removes all tokens in the dictionary that are:

    1. Less frequent than no_below documents (absolute number, e.g. 5) or
    2. More frequent than no_above documents (fraction of the total corpus size, e.g. 0.3).
    3. After (1) and (2), keep only the first keep_n most frequent tokens (or keep all if keep_n=None).

您仅通过了no_below=1。这意味着出现在少于 1 个文档(共 3 个)中的标记将被删除。这意味着 a 以及语料库中的任何其他标记都会保留。

但随后会根据其默认值检查 no_above=0.5，因为您没有为此关键字传递显式值。这意味着超过 50% 的文档中出现的标记(3 个文档中，即至少出现在 2 个文档中的标记)将被删除。而'a'出现在所有3个文档中，事实上它是唯一一个出现在至少2个文档中的。这就是为什么此标记和仅此标记从结果中删除的原因。 (keep_n 的默认值 10000 意味着步骤 3 在您的示例中是无操作。)

如果您仅想要去除低频极值标记，请将显式 no_above=1.0 传递给 filter_extremes。

关于python - gensim中filter_extreme的使用误区，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/53037373/

25

4

0

文章推荐： java - 传递整数而不是资源颜色

文章推荐： Java - 获取id从链接中删除所有字符

文章推荐： python - 将 numpy 数组垂直添加到自身

Tensorflow Relu 误区
我最近在做一个基于 TensorFlow 的 Udacity 深度学习类(class)。 .我有一个简单的 MNIST大约 92% 准确的程序: from tensorflow.examples.tu
这些「误区」99%的研发都踩过
意识不到误区的存在最为离谱； 01 生活中，职场上，游戏里，都少不了正面对喷过：意识太差；在个人的认知中意识即思维，意识太差即思维中存在的误区比较多；

首页

博学

6Ren·AI

商城

python - gensim中filter_extreme的使用误区