gpt4 book ai didi

python - 如何获得句子文本中二元组的概率?

转载 作者:太空狗 更新时间:2023-10-30 02:51:38 24 4
gpt4 key购买 nike

我有一篇文章有​​很多句子。我如何使用 nltk.ngrams 来处理它?<​​/p>

这是我的代码:

   sequence = nltk.tokenize.word_tokenize(raw) 
bigram = ngrams(sequence,2)
freq_dist = nltk.FreqDist(bigram)
prob_dist = nltk.MLEProbDist(freq_dist)
number_of_bigrams = freq_dist.N()

但是,上面的代码假设所有的句子都是一个序列。但是,句子是分开的,我猜一个句子的最后一个词与另一个句子的开始词无关。我如何为这样的文本创建一个 bigram?我还需要基于`freq_dist 的prob_distnumber_of_bigrams

还有类似这样的问题What are ngram counts and how to implement using nltk?但它们主要是关于一系列单词。

最佳答案

您可以使用新的 nltk.lm 模块。这是一个示例,首先获取一些数据并将其标记化:

import os
import requests
import io #codecs

from nltk import word_tokenize, sent_tokenize

# Text version of https://kilgarriff.co.uk/Publications/2005-K-lineer.pdf
if os.path.isfile('language-never-random.txt'):
with io.open('language-never-random.txt', encoding='utf8') as fin:
text = fin.read()
else:
url = "https://gist.githubusercontent.com/alvations/53b01e4076573fea47c6057120bb017a/raw/b01ff96a5f76848450e648f35da6497ca9454e4a/language-never-random.txt"
text = requests.get(url).content.decode('utf8')
with io.open('language-never-random.txt', 'w', encoding='utf8') as fout:
fout.write(text)

# Tokenize the text.
tokenized_text = [list(map(str.lower, word_tokenize(sent)))
for sent in sent_tokenize(text)]

然后是语言建模:

# Preprocess the tokenized text for 3-grams language modelling
from nltk.lm.preprocessing import padded_everygram_pipeline
from nltk.lm import MLE

n = 3
train_data, padded_sents = padded_everygram_pipeline(n, tokenized_text)

model = MLE(n) # Lets train a 3-grams maximum likelihood estimation model.
model.fit(train_data, padded_sents)

获取计数:

model.counts['language'] # i.e. Count('language')
model.counts[['language']]['is'] # i.e. Count('is'|'language')
model.counts[['language', 'is']]['never'] # i.e. Count('never'|'language is')

获取概率:

model.score('is', 'language'.split())  # P('is'|'language')
model.score('never', 'language is'.split()) # P('never'|'language is')

在加载笔记本时,Kaggle 平台上存在一些问题,但在某些时候,该笔记本应该可以很好地概述 nltk.lm 模块 https://www.kaggle.com/alvations/n-gram-language-model-with-nltk

关于python - 如何获得句子文本中二元组的概率?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/54962539/

24 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com