作者热门文章
- html - 出于某种原因,IE8 对我的 Sass 文件中继承的 html5 CSS 不友好?
- JMeter 在响应断言中使用 span 标签的问题
- html - 在 :hover and :active? 上具有不同效果的 CSS 动画
- html - 相对于居中的 html 内容固定的 CSS 重复背景?
我已经给出了训练文本和测试文本。我想做的是通过训练数据训练语言模型来计算测试数据的困惑度。
这是我的代码:
import os
import requests
import io #codecs
from nltk.util import everygrams
from nltk.lm.preprocessing import pad_both_ends
from nltk import word_tokenize, sent_tokenize
fileTest = open("AaronPressman.txt","r");
with io.open('AaronPressman.txt', encoding='utf8') as fin:
textTest = fin.read()
if os.path.isfile('AaronPressmanEdited.txt'):
with io.open('AaronPressmanEdited.txt', encoding='utf8') as fin:
text = fin.read()
# Tokenize the text.
tokenized_text = [list(map(str.lower, word_tokenize(sent)))
for sent in sent_tokenize(text)]
from nltk.lm.preprocessing import padded_everygram_pipeline
from nltk.lm import Laplace
n = 1
padded_bigrams = list(pad_both_ends(word_tokenize(textTest), n=1))
trainTest = everygrams(padded_bigrams, min_len = n , max_len=n);
train_data, padded_sents = padded_everygram_pipeline(n, tokenized_text)
model = Laplace(n)
model.fit(train_data, padded_sents)
print(model.perplexity(trainTest))
当我使用 n=1(即一元语法)运行此代码时,我得到“1068.332393940235”
。当 n=2 或二元组时,我得到 "1644.3441077259993"
,对于三元组,我得到 2552.2085752565313
。
它有什么问题吗?
最佳答案
您创建测试数据的方式是错误的(小写火车数据,但测试数据未转换为小写。测试数据中缺少开始和结束标记)。试试这个
import os
import requests
import io #codecs
from nltk.util import everygrams
from nltk.lm.preprocessing import pad_both_ends
from nltk.lm.preprocessing import padded_everygram_pipeline
from nltk.lm import Laplace
from nltk import word_tokenize, sent_tokenize
"""
fileTest = open("AaronPressman.txt","r");
with io.open('AaronPressman.txt', encoding='utf8') as fin:
textTest = fin.read()
if os.path.isfile('AaronPressmanEdited.txt'):
with io.open('AaronPressmanEdited.txt', encoding='utf8') as fin:
text = fin.read()
"""
textTest = "This is an ant. This is a cat"
text = "This is an orange. This is a mango"
n = 2
# Tokenize the text.
tokenized_text = [list(map(str.lower, word_tokenize(sent)))
for sent in sent_tokenize(text)]
train_data, padded_sents = padded_everygram_pipeline(n, tokenized_text)
tokenized_text = [list(map(str.lower, word_tokenize(sent)))
for sent in sent_tokenize(textTest)]
test_data, padded_sents = padded_everygram_pipeline(n, tokenized_text)
model = Laplace(1)
model.fit(train_data, padded_sents)
s = 0
for i, test in enumerate(test_data):
p = model.perplexity(test)
s += p
print ("Perplexity: {0}".format(s/(i+1)))
关于python - NLTK 困惑度测度反演,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/54989825/
我是一名优秀的程序员,十分优秀!