python - 从大量 .txt 文件及其频率生成 Ngram(Unigrams、Bigrams 等)-6ren

python - 从大量 .txt 文件及其频率生成 Ngram(Unigrams、Bigrams 等)

转载作者：IT老高更新时间：2023-10-28 21:08:29

25

4

我需要在 NLTK 中编写一个程序，将语料库(大量 txt 文件)分解为 unigrams、bigrams、trigrams、fourgrams 和 Fivegrams。我已经编写了代码来将我的文件输入到程序中。

输入是 300 个用英文编写的 .txt 文件，我想要 Ngrams 形式的输出，特别是频率计数。

我知道 NLTK 有 Bigram 和 Trigram 模块:http://www.nltk.org/_modules/nltk/model/ngram.html

但我没有那么先进，无法将它们输入我的程序。

输入:txt 文件不是单句

输出示例:

Bigram [('Hi', 'How'), ('How', 'are'), ('are', 'you'), ('you', '?'), ('?', 'i'), ('i', 'am'), ('am', 'fine'), ('fine', 'and'), ('and', 'you')] 

Trigram: [('Hi', 'How', 'are'), ('How', 'are', 'you'), ('are', 'you', '?'), ('you', '?', 'i'), ('?', 'i', 'am'), ('i', 'am', 'fine'), ('am', 'fine', 'and'), ('fine', 'and', 'you')]

到目前为止我的代码是:

from nltk.corpus import PlaintextCorpusReader
corpus = 'C:/Users/jack3/My folder'
files = PlaintextCorpusReader(corpus, '.*')
ngrams=2

def generate(file, ngrams):
    for gram in range(0, ngrams):
    print((file[0:-4]+"_"+str(ngrams)+"_grams.txt").replace("/","_"))


for file in files.fileids():
generate(file, ngrams)

任何帮助下一步应该做什么？

最佳答案

只需使用 ntlk.ngrams。

import nltk
from nltk import word_tokenize
from nltk.util import ngrams
from collections import Counter

text = "I need to write a program in NLTK that breaks a corpus (a large collection of \
txt files) into unigrams, bigrams, trigrams, fourgrams and fivegrams.\ 
I need to write a program in NLTK that breaks a corpus"
token = nltk.word_tokenize(text)
bigrams = ngrams(token,2)
trigrams = ngrams(token,3)
fourgrams = ngrams(token,4)
fivegrams = ngrams(token,5)

print Counter(bigrams)

Counter({('program', 'in'): 2, ('NLTK', 'that'): 2, ('that', 'breaks'): 2,
 ('write', 'a'): 2, ('breaks', 'a'): 2, ('to', 'write'): 2, ('I', 'need'): 2,
 ('a', 'corpus'): 2, ('need', 'to'): 2, ('a', 'program'): 2, ('in', 'NLTK'): 2,
 ('and', 'fivegrams'): 1, ('corpus', '('): 1, ('txt', 'files'): 1, ('unigrams', 
','): 1, (',', 'trigrams'): 1, ('into', 'unigrams'): 1, ('trigrams', ','): 1,
 (',', 'bigrams'): 1, ('large', 'collection'): 1, ('bigrams', ','): 1, ('of',
 'txt'): 1, (')', 'into'): 1, ('fourgrams', 'and'): 1, ('fivegrams', '.'): 1,
 ('(', 'a'): 1, (',', 'fourgrams'): 1, ('a', 'large'): 1, ('.', 'I'): 1, 
('collection', 'of'): 1, ('files', ')'): 1})

更新(纯python):

import os

corpus = []
path = '.'
for i in os.walk(path).next()[2]:
    if i.endswith('.txt'):
        f = open(os.path.join(path,i))
        corpus.append(f.read())
frequencies = Counter([])
for text in corpus:
    token = nltk.word_tokenize(text)
    bigrams = ngrams(token, 2)
    frequencies += Counter(bigrams)

关于python - 从大量 .txt 文件及其频率生成 Ngram(Unigrams、Bigrams 等)，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/32441605/

25

4

0

文章推荐： python - 类型对象 'datetime.datetime' 没有属性 'datetime'

文章推荐：来自字符串源的 Python xml ElementTree？

python - 将 trigrams、bigrams 和 unigrams 与文本匹配；如果 unigram 或 bigram 是已经匹配的 trigram 的子串，则通过； Python
main_text 是一个列表列表，其中包含已被词性标记的句子: main_text = [[('the', 'DT'), ('mad', 'JJ'), ('hatter', 'NN'), ('li
Python Bigram 字典格式
对于学校，我必须制作一本字典，其中包含有关文本文件中连续单词的信息。对于文件中的每个单词，我必须输入该单词(键)和匹配值，该匹配值由可以跟随该键的单词列表组成。例如下面这句话: “我认为你认为他会
python - 按出现次数排序 Bigram NLTK
我目前正在运行此代码以在我的整个文本处理过程中搜索二元语法。变量 alltext 是非常长的文本(超过 100 万个单词) 我运行这段代码来提取二元语法 from nltk.tokenize imp
python - NLTK - Bigram 的计数频率
这是一个 Python 和 NLTK 新手问题。我想找出同时出现 10 次以上且 PMI 最高的二元组的频率。为此，我正在使用此代码 def get_list_phrases(text):
r - 在 Quanteda 中使用字典创建 Bigram
我正在尝试从数据文本分析中删除拼写错误。所以我使用 Quanteda 包的字典功能。对于 Unigrams 来说效果很好。但它为 Biggram 提供了意想不到的输出。不知道如何处理拼写错误，以免它们
Python NLTK : Bigrams trigrams fourgrams
我有这个例子，我想知道如何得到这个结果。我有文本并将其标记化，然后像这样收集二元组、三元组和四元组 import nltk from nltk import word_tokenize from nl
python - 用 Python 在句子列表中形成单词的 Bigrams
我有一个句子列表: text = ['cant railway station','citadel hotel',' police stn']. 我需要形成二元对并将它们存储在一个变量中。问题是当我这
python-2.7 - 如何使用gensim将一组文档标记为unigram + bigram bagofwords？
我知道我可以使用 scikit learn， vectorizer = TfidfVectorizer(min_df=2,ngram_range=(1, 2),norm='l2') corpus =
vim - Vim 是否有任何高级(例如 bigram)自动完成插件？
对于我的特殊情况，Vim 的自动完成功能通常不是那么智能。有没有办法切换到例如二元模型(根据前一个词进行预测)，或者更好的模型？自己写会不会很难(假设我知道如何在外部程序中编写/使用 n-gram 直
python - 为什么填充词汇的困惑对于 nltk.lm bigram 来说是不定式？
我正在测试perplexity文本语言模型的测量: train_sentences = nltk.sent_tokenize(train_text) test_sentences = nltk
python - Pandas Dataframe 的 Bigram Finder
我有一个二元组列表。我有一个 pandas 数据框，其中包含语料库中每个文档的一行。我想要做的是将每个文档中的列表中匹配的二元组放入数据框中的新列中。完成这项任务的最佳方法是什么？我一直在寻找有关堆
ubuntu - Bigram Model checker 安装报错 Makefile 命令问题
我想安装模型检查器 BigMC 工具 (https://github.com/AlessandroCaste/bigmc)，但出现以下错误: 最佳答案这种语言结构仅在 C++11 标准之后才可用，并
n-gram - 自然语言处理中的 Unigram、Bigram 和 Posgram
我想知道unigram、bigram和posgram之间的含义和区别是什么。我在互联网上搜索过，但找不到全面的答案。任何帮助将非常感激。最佳答案 “这是一个例句” 一元语法:一次考虑一个单词 ->“
n-gram - 自然语言处理中的 Unigram、Bigram 和 Posgram
我想知道unigram、bigram和posgram之间的含义和区别是什么。我在互联网上搜索过，但找不到全面的答案。任何帮助将非常感激。最佳答案 “这是一个例句” 一元语法:一次考虑一个单词 ->“
Java HashMap 大小限制？ Bigram Frequency Count 中的某些键正在消失
我正在用 Java 编写一个简单的二元组频率计数算法，遇到了一个我不知道如何解决的问题。我的源文件是一个 9MB 的 .txt 文件，其中包含随机单词，以空格分隔。当我运行将输入限制为前 100
excel - 解析列中的行以列出 excel 中的每个 unigram、bigram 和 trigram
如标题所述，我想使用空格作为分隔符来解析表中的每一行(1 列，~1k 行)。每行包含一个短语。我想列出每个短语的所有 unigrams、bigrams 和 trigrams。下面的示例数据和所需的输出
python - 在 TF-IDF 中结合 Unigram 和 Bigram
我正在做一个项目，我们正试图在分为多个集群的文章标题语料库上生成 TF-IDF。我们的目标是让它包含最重要的一元组和二元组同时对于每个集群。我们的计划是这样的。我们首先在我们的语料库中确定最可能的二
python - 从大量 .txt 文件及其频率生成 Ngram(Unigrams、Bigrams 等)
我需要在 NLTK 中编写一个程序，将语料库(大量 txt 文件)分解为 unigrams、bigrams、trigrams、fourgrams 和 Fivegrams。我已经编写了代码来将我的文件输
java - Lucene 4.6 中的 ShingleFilter 为 bigrams 添加单词
我对 Lucene 4.6 中 ShingleFilter 的奇怪行为感到困惑。我想做的是从一个句子中提取所有可能的二元组。所以如果句子是“this is a dog”，我想要“this is”，“i
r - 使用 R 和 Rweka 在 termdocument 矩阵中使用 bigrams 而不是单个单词
我找到了一种在术语文档矩阵中使用二元组而不是单个标记的方法。解决方案已在 stackoverflow 上提出:findAssocs for multiple terms in R 这个想法是这样的:

首页

博学

6Ren·AI

商城

python - 从大量 .txt 文件及其频率生成 Ngram(Unigrams、Bigrams 等)