gpt4 book ai didi

Python - NLTK 中的三元组概率分布平滑技术 (Kneser Ney) 返回零

转载 作者:行者123 更新时间:2023-12-01 07:35:16 25 4
gpt4 key购买 nike

我得到了我的卦象的频率分布,然后训练了克内塞尔-内伊。当我检查不在 list_of_trigrams 中的三元组的 kneser_ney.prob 时,我得到零!我做错了什么?

freq_dist = nltk.FreqDist(list_of_trigrams)
kneser_ney = nltk.KneserNeyProbDist(freq_dist)

列表中甚至有 n-1-gram,这就是我想要的:

print(kneser_ney.prob(('ئامادەكاری', 'بۆ', 'تاقیكردنەوە')))

这就是我的列表

('ئامادەكاری', 'بۆ', 'كارە')

我在网上寻找任何与我有相同问题的人,但没有找到...

最佳答案

我认为您所观察到的情况是完全正常的。

摘自维基百科页面(方法部分)Kneser-Ney smoothing :

Please note that p_KN is a proper distribution, as the values defined in above way are non-negative and sum to one.

ngram没有出现在语料库中时,概率为0

引自answer you cite :

This is the whole point of smoothing, to reallocate some probability mass from the ngrams appearing in the corpus to those that don't so that you don't end up with a bunch of 0 probability ngrams.

上面这句话并不意味着通过 Kneser-Ney 平滑,您选择的任何 ngram 都会有非零概率,这意味着,给定一个语料库,它将为以这样的方式处理现有的 ngram,以便您有一些备用概率在以后的分析中用于其他 ngram。这个备用概率是必须为非出现的ngram 分配的东西,而不是Kneser-Ney 平滑固有的东西。

<小时/>

编辑

为了完整起见,我报告了观察行为的代码(主要取自 here ,并适应了 Python 3):

import nltk
nltk.download('gutenberg')
nltk.download('punkt')
from nltk.util import ngrams
from nltk.corpus import gutenberg

gut_ngrams = tuple(
ngram for sent in gutenberg.sents()
for ngram in ngrams(
sent, 3, pad_left=True, pad_right=True,
right_pad_symbol='EOS', left_pad_symbol="BOS"))
freq_dist = nltk.FreqDist(gut_ngrams)
kneser_ney = nltk.KneserNeyProbDist(freq_dist)

prob_sum = 0
for i in kneser_ney.samples():
if i[0] == "I" and i[1] == "confess":
prob_sum += kneser_ney.prob(i)
print("{0}:{1}".format(i, kneser_ney.prob(i)))
print(prob_sum)
# ('I', 'confess', ','):0.26973684210526316
# ('I', 'confess', 'that'):0.16447368421052633
# ('I', 'confess', '.--'):0.006578947368421052
# ('I', 'confess', 'it'):0.03289473684210526
# ('I', 'confess', 'I'):0.16447368421052633
# ('I', 'confess', ',"'):0.03289473684210526
# ('I', 'confess', ';'):0.006578947368421052
# ('I', 'confess', 'myself'):0.006578947368421052
# ('I', 'confess', 'is'):0.006578947368421052
# ('I', 'confess', 'also'):0.006578947368421052
# ('I', 'confess', 'unto'):0.006578947368421052
# ('I', 'confess', '"--'):0.006578947368421052
# ('I', 'confess', 'what'):0.006578947368421052
# ('I', 'confess', 'there'):0.006578947368421052
# 0.7236842105263156

# trigram not appearing in corpus
print(kneser_ney.prob(('I', 'confess', 'nothing')))
# 0.0

关于Python - NLTK 中的三元组概率分布平滑技术 (Kneser Ney) 返回零,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/57017064/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com