gpt4 book ai didi

python - NLTK 困惑度测度反演

转载 作者:行者123 更新时间:2023-11-30 09:43:57 26 4
gpt4 key购买 nike

我已经给出了训练文本和测试文本。我想做的是通过训练数据训练语言模型来计算测试数据的困惑度。

这是我的代码:

import os
import requests
import io #codecs
from nltk.util import everygrams
from nltk.lm.preprocessing import pad_both_ends

from nltk import word_tokenize, sent_tokenize

fileTest = open("AaronPressman.txt","r");
with io.open('AaronPressman.txt', encoding='utf8') as fin:
textTest = fin.read()
if os.path.isfile('AaronPressmanEdited.txt'):
with io.open('AaronPressmanEdited.txt', encoding='utf8') as fin:
text = fin.read()

# Tokenize the text.
tokenized_text = [list(map(str.lower, word_tokenize(sent)))
for sent in sent_tokenize(text)]

from nltk.lm.preprocessing import padded_everygram_pipeline
from nltk.lm import Laplace

n = 1
padded_bigrams = list(pad_both_ends(word_tokenize(textTest), n=1))
trainTest = everygrams(padded_bigrams, min_len = n , max_len=n);

train_data, padded_sents = padded_everygram_pipeline(n, tokenized_text)


model = Laplace(n)
model.fit(train_data, padded_sents)

print(model.perplexity(trainTest))

当我使用 n=1(即一元语法)运行此代码时,我得到“1068.332393940235”。当 n=2 或二元组时,我得到 "1644.3441077259993",对于三元组,我得到 2552.2085752565313

它有什么问题吗?

最佳答案

您创建测试数据的方式是错误的(小写火车数据,但测试数据未转换为小写。测试数据中缺少开始和结束标记)。试试这个

import os
import requests
import io #codecs
from nltk.util import everygrams
from nltk.lm.preprocessing import pad_both_ends
from nltk.lm.preprocessing import padded_everygram_pipeline
from nltk.lm import Laplace
from nltk import word_tokenize, sent_tokenize

"""
fileTest = open("AaronPressman.txt","r");
with io.open('AaronPressman.txt', encoding='utf8') as fin:
textTest = fin.read()
if os.path.isfile('AaronPressmanEdited.txt'):
with io.open('AaronPressmanEdited.txt', encoding='utf8') as fin:
text = fin.read()
"""
textTest = "This is an ant. This is a cat"
text = "This is an orange. This is a mango"

n = 2
# Tokenize the text.
tokenized_text = [list(map(str.lower, word_tokenize(sent)))
for sent in sent_tokenize(text)]
train_data, padded_sents = padded_everygram_pipeline(n, tokenized_text)

tokenized_text = [list(map(str.lower, word_tokenize(sent)))
for sent in sent_tokenize(textTest)]
test_data, padded_sents = padded_everygram_pipeline(n, tokenized_text)

model = Laplace(1)
model.fit(train_data, padded_sents)

s = 0
for i, test in enumerate(test_data):
p = model.perplexity(test)
s += p

print ("Perplexity: {0}".format(s/(i+1)))

关于python - NLTK 困惑度测度反演,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/54989825/

26 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com