gpt4 book ai didi

python - 使用文本搭配计算 ngram 词频

转载 作者:太空宇宙 更新时间:2023-11-03 21:18:51 25 4
gpt4 key购买 nike

我想计算已转换为标记的文本文件中特定单词前后三个单词的频率。

from nltk.tokenize import sent_tokenize
from nltk.tokenize import word_tokenize
from nltk.util import ngrams
with open('dracula.txt', 'r', encoding="ISO-8859-1") as textfile:
text_data = textfile.read().replace('\n', ' ').lower()
tokens = nltk.word_tokenize(text_data)
text = nltk.Text(tokens)
grams = nltk.ngrams(tokens, 4)
freq = Counter(grams)
freq.most_common(20)

我不知道如何搜索字符串“dracula”作为过滤词。我也尝试过:

text.collocations(num=100)
text.concordance('dracula')

所需的输出看起来像这样,带有计数:“dracula”之前的三个单词,已排序计数

(('and', 'he', 'saw', 'dracula'), 4),
(('one', 'cannot', 'see', 'dracula'), 2)

“dracula”后面的三个单词,已排序计数

(('dracula', 'and', 'he', 'saw'), 4),
(('dracula', 'one', 'cannot', 'see'), 2)

中间包含“dracula”的三元组,已排序计数

(('count', 'dracula', 'saw'), 4),
(('count', 'dracula', 'cannot'), 2)

预先感谢您的帮助。

最佳答案

一旦获得元组格式的频率信息(如您所做的那样),您就可以使用 if 语句简单地过滤出您要查找的单词。这是使用 Python 的列表理解语法:

from nltk.tokenize import sent_tokenize
from nltk.tokenize import word_tokenize
from nltk.util import ngrams

with open('dracula.txt', 'r', encoding="ISO-8859-1") as textfile:
text_data = textfile.read().replace('\n', ' ').lower()
# pulled text from here: https://archive.org/details/draculabr00stokuoft/page/n6

tokens = nltk.word_tokenize(text_data)
text = nltk.Text(tokens)
grams = nltk.ngrams(tokens, 4)
freq = nltk.Counter(grams)

dracula_last = [item for item in freq.most_common() if item[0][3] == 'dracula']
dracula_first = [item for item in freq.most_common() if item[0][0] == 'dracula']
dracula_second = [item for item in freq.most_common() if item[0][1] == 'dracula']
# etc.

这会生成在不同位置包含“dracula”的列表。 dracula_last 如下所示:

[(('the', 'castle', 'of', 'dracula'), 3),
(("'s", 'journal', '243', 'dracula'), 1),
(('carpathian', 'moun-', '2', 'dracula'), 1),
(('of', 'the', 'castle', 'dracula'), 1),
(('named', 'by', 'count', 'dracula'), 1),
(('disease', '.', 'count', 'dracula'), 1),
...]

关于python - 使用文本搭配计算 ngram 词频,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/54471926/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com