python - 为什么填充词汇的困惑对于 nltk.lm bigram 来说是不定式？-6ren

python - 为什么填充词汇的困惑对于 nltk.lm bigram 来说是不定式？

转载作者：行者123 更新时间：2023-12-01 01:09:25

我正在测试perplexity文本语言模型的测量:

  train_sentences = nltk.sent_tokenize(train_text)
  test_sentences = nltk.sent_tokenize(test_text)

  train_tokenized_text = [list(map(str.lower, nltk.tokenize.word_tokenize(sent))) 
                for sent in train_sentences]

  test_tokenized_text = [list(map(str.lower, nltk.tokenize.word_tokenize(sent))) 
                for sent in test_sentences]

  from nltk.lm.preprocessing import padded_everygram_pipeline
  from nltk.lm import MLE,Laplace
  from nltk.lm import Vocabulary

  vocab = Vocabulary(nltk.tokenize.word_tokenize(train_text),1);

  n = 2
  print(train_tokenized_text)
  print(len(train_tokenized_text))
  train_data, padded_vocab = padded_everygram_pipeline(n, train_tokenized_text)

  # print(list(vocab),"\n >>>>",list(padded_vocab))
  model = MLE(n) # Lets train a 3-grams maximum likelihood estimation model.
  # model.fit(train_data, padded_vocab)
  model.fit(train_data, vocab)

  sentences = test_sentences
  print("len: ",len(sentences))
  print("per all", model.perplexity(test_text))

当我使用vocab时在model.fit(train_data, vocab) print("per all", model.perplexity(test_text))中的困惑是一个数字( 30.2 )，但如果我使用 padded_vocab其中有额外的 <s>和</s>它打印 inf .

最佳答案

perplexity 的输入是 ngram 中的文本，而不是字符串列表。您可以通过运行来验证相同的内容

for x in test_text:
    print ([((ngram[-1], ngram[:-1]),model.score(ngram[-1], ngram[:-1])) for ngram in x])

您应该看到标记(ngram)都是错误的。

如果测试数据中的单词超出了(训练数据)的词汇范围，您仍然会感到困惑

train_sentences = nltk.sent_tokenize(train_text)
test_sentences = nltk.sent_tokenize(test_text)

train_sentences = ['an apple', 'an orange']
test_sentences = ['an apple']

train_tokenized_text = [list(map(str.lower, nltk.tokenize.word_tokenize(sent))) 
                for sent in train_sentences]

test_tokenized_text = [list(map(str.lower, nltk.tokenize.word_tokenize(sent))) 
                for sent in test_sentences]

from nltk.lm.preprocessing import padded_everygram_pipeline
from nltk.lm import MLE,Laplace
from nltk.lm import Vocabulary

n = 1
train_data, padded_vocab = padded_everygram_pipeline(n, train_tokenized_text)
model = MLE(n)
# fit on padded vocab that the model know the new tokens added to vocab (<s>, </s>, UNK etc)
model.fit(train_data, padded_vocab) 

test_data, _ = padded_everygram_pipeline(n, test_tokenized_text)
for test in test_data:
    print("per all", model.perplexity(test))

# out of vocab test data
test_sentences = ['an ant']
test_tokenized_text = [list(map(str.lower, nltk.tokenize.word_tokenize(sent))) 
                for sent in test_sentences]
test_data, _ = padded_everygram_pipeline(n, test_tokenized_text)
for test in test_data:
    print("per all [oov]", model.perplexity(test))

关于python - 为什么填充词汇的困惑对于 nltk.lm bigram 来说是不定式？，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/54999684/

文章推荐： JQuery:使用数据属性附加函数

文章推荐： jQuery 拖动选择，仅限于元素，而不是典型的拖动框

文章推荐： javascript - 按顺序从嵌套循环执行异步处理函数

c# - 字典 API(词汇)
关闭。这个问题不符合Stack Overflow guidelines .它目前不接受答案。我们不允许提问寻求书籍、工具、软件库等的推荐。您可以编辑问题，以便用事实和引用来回答。关闭 4 年前。
semantic-web - 了解要使用的 RDFA 词汇
我们如何知道使用哪个词汇/命名空间来描述带有 RDFa 的数据？我看过很多使用 xmlns:dcterms="http://purl.org/dc/terms/" 的例子或 xmlns:sioc="
huggingface-transformers - 理解 BERT 词汇 [unusedxxx] token :
我正在尝试理解 BERT 词汇 here .它有 1000 个 [unusedxxx] token 。我不遵循这些 token 的用法。我了解其他特殊 token ，如 [SEP]、[CLS]，但 [
Oracle 词汇，什么是 mysql/SQL Server 相当于数据库
我需要一些词汇方面的帮助，我不经常使用 Oracle，但我熟悉 MySQL 和 SQL Server。我有一个应用程序需要升级和迁移，执行此操作的部分过程涉及导出到 XML 文件，允许安装程序创建新
ruby - 解析 RDFa、微数据等的最佳方式是什么，使用统一的模式/词汇(例如 schema.org)存储和显示信息
我主要使用 Ruby 来执行此操作，但到目前为止我的攻击计划如下: 使用 gems rdf、rdf-rdfa 和 rdf-microdata 或 mida 来解析给定任何 URI 的数据。我认为最好映

行者123

个人简介

我是一名优秀的程序员,十分优秀！

作者热门文章

滴滴打车优惠券免费领取

全站热门文章

首页

博学

6Ren·AI

商城

python - 为什么填充词汇的困惑对于 nltk.lm bigram 来说是不定式？