python - 高效计算字符串中的词频-6ren

python - 高效计算字符串中的词频

转载作者：太空狗更新时间：2023-10-29 19:31:31

25

4

我正在解析一长串文本并计算每个单词在 Python 中出现的次数。我有一个有效的函数，但我正在寻找关于是否有方法可以使它更有效率(在速度方面)以及是否有 python 库函数可以为我做这件事的建议，所以我不会重新发明轮子？

您能否建议一种更有效的方法来计算长字符串(通常在字符串中超过 1000 个单词)中出现的最常见单词？

另外，将字典排序到列表中的最佳方法是什么，其中第一个元素是最常见的单词，第二个元素是第二个最常见的单词等等？

test = """abc def-ghi jkl abc
abc"""

def calculate_word_frequency(s):
    # Post: return a list of words ordered from the most
    # frequent to the least frequent

    words = s.split()
    freq  = {}
    for word in words:
        if freq.has_key(word):
            freq[word] += 1
        else:
            freq[word] = 1
    return sort(freq)

def sort(d):
    # Post: sort dictionary d into list of words ordered
    # from highest freq to lowest freq
    # eg: For {"the": 3, "a": 9, "abc": 2} should be
    # sorted into the following list ["a","the","abc"]

    #I have never used lambda's so I'm not sure this is correct
    return d.sort(cmp = lambda x,y: cmp(d[x],d[y]))

print calculate_word_frequency(test)

最佳答案

使用collections.Counter :

>>> from collections import Counter
>>> test = 'abc def abc def zzz zzz'
>>> Counter(test.split()).most_common()
[('abc', 2), ('zzz', 2), ('def', 2)]

关于python - 高效计算字符串中的词频，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/9919604/

25

4

0

文章推荐： angular - RxJS:WebSocket 重连处理

文章推荐： javascript - 如何有条件地向 app.module 添加拦截器？

文章推荐： python - 无法获得单例\在 python

Java - 词频
我在 Eclipse 中创建了一个 Java 程序。该程序计算每个单词的频率。例如，如果用户输入“I went to the shop”，程序将产生输出“1 1 1 2”，即 1 个字长 1 ('I'
r - R 中的文本分析 - 词频
我在工作中只有 R 可用，而且我以前用 Python 做过。我需要获取 CSV 文件中每组事件的计数。我在 Python 中进行了情绪分析，我在提供的表格中搜索了一本 Python 字典，其中包含每个
c++ - 词频 strcmp 使用结构数组无限工作
我想一个字一个字地读，然后将哪个字与我的结构数组中的字进行比较。如果我没有，我想在第一个空位添加。 #include #include #include #include using names
python - 使用文本搭配计算 ngram 词频
我想计算已转换为标记的文本文件中特定单词前后三个单词的频率。 from nltk.tokenize import sent_tokenize from nltk.tokenize import wor
java - 词频 - HashMap 或 TreeMap
我需要编写一个程序来计算文本中每个单词的频率，此外我需要能够返回 n 个最常用单词的列表(如果更多单词具有相同的频率(它们按字母顺序排序)。还有一个未计算在内的单词列表(停用词)。停用词使用什么结构
python - sklearn 的 TfidfVectorizer 词频？
我对 sklearn 的 TfidfVectorizer 在计算每个文档中单词的频率时有一个疑问。我看到的示例代码是: >>> from sklearn.feature_extraction.tex

首页

博学

6Ren·AI

商城

python - 高效计算字符串中的词频