gpt4 book ai didi

python - 为什么填充词汇的困惑对于 nltk.lm bigram 来说是不定式?

转载 作者:行者123 更新时间:2023-12-01 01:09:25 34 4
gpt4 key购买 nike

我正在测试perplexity文本语言模型的测量:

  train_sentences = nltk.sent_tokenize(train_text)
test_sentences = nltk.sent_tokenize(test_text)

train_tokenized_text = [list(map(str.lower, nltk.tokenize.word_tokenize(sent)))
for sent in train_sentences]

test_tokenized_text = [list(map(str.lower, nltk.tokenize.word_tokenize(sent)))
for sent in test_sentences]

from nltk.lm.preprocessing import padded_everygram_pipeline
from nltk.lm import MLE,Laplace
from nltk.lm import Vocabulary

vocab = Vocabulary(nltk.tokenize.word_tokenize(train_text),1);

n = 2
print(train_tokenized_text)
print(len(train_tokenized_text))
train_data, padded_vocab = padded_everygram_pipeline(n, train_tokenized_text)

# print(list(vocab),"\n >>>>",list(padded_vocab))
model = MLE(n) # Lets train a 3-grams maximum likelihood estimation model.
# model.fit(train_data, padded_vocab)
model.fit(train_data, vocab)

sentences = test_sentences
print("len: ",len(sentences))
print("per all", model.perplexity(test_text))

当我使用vocab时在model.fit(train_data, vocab) print("per all", model.perplexity(test_text))中的困惑是一个数字( 30.2 ),但如果我使用 padded_vocab其中有额外的 <s></s>它打印 inf .

最佳答案

perplexity 的输入是 ngram 中的文本,而不是字符串列表。您可以通过运行来验证相同的内容

for x in test_text:
print ([((ngram[-1], ngram[:-1]),model.score(ngram[-1], ngram[:-1])) for ngram in x])

您应该看到标记(ngram)都是错误的。

如果测试数据中的单词超出了(训练数据)的词汇范围,您仍然会感到困惑

train_sentences = nltk.sent_tokenize(train_text)
test_sentences = nltk.sent_tokenize(test_text)

train_sentences = ['an apple', 'an orange']
test_sentences = ['an apple']

train_tokenized_text = [list(map(str.lower, nltk.tokenize.word_tokenize(sent)))
for sent in train_sentences]

test_tokenized_text = [list(map(str.lower, nltk.tokenize.word_tokenize(sent)))
for sent in test_sentences]

from nltk.lm.preprocessing import padded_everygram_pipeline
from nltk.lm import MLE,Laplace
from nltk.lm import Vocabulary

n = 1
train_data, padded_vocab = padded_everygram_pipeline(n, train_tokenized_text)
model = MLE(n)
# fit on padded vocab that the model know the new tokens added to vocab (<s>, </s>, UNK etc)
model.fit(train_data, padded_vocab)

test_data, _ = padded_everygram_pipeline(n, test_tokenized_text)
for test in test_data:
print("per all", model.perplexity(test))

# out of vocab test data
test_sentences = ['an ant']
test_tokenized_text = [list(map(str.lower, nltk.tokenize.word_tokenize(sent)))
for sent in test_sentences]
test_data, _ = padded_everygram_pipeline(n, test_tokenized_text)
for test in test_data:
print("per all [oov]", model.perplexity(test))

关于python - 为什么填充词汇的困惑对于 nltk.lm bigram 来说是不定式?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/54999684/

34 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com