gpt4 book ai didi

deep-learning - Keras Generative LSTM 仅预测停用词

转载 作者:行者123 更新时间:2023-12-04 06:04:09 25 4
gpt4 key购买 nike

我使用 LSTM 在 keras 中创建了一个模型,用于预测给定单词序列的下一个单词。下面是我的代码:

    # Small LSTM Network to Generate Text for Alice in Wonderland
# load ascii text and covert to lowercase
filename = "wonderland.txt"
raw_text = open(filename).read()
raw_text = raw_text.lower()
print raw_text
# create mapping of unique words to integers
print raw_text
raw_text = re.sub(r'[^\w\s]','',raw_text)
raw_text = re.sub('[^a-z\ \']+', " ", raw_text)
words_unsorted=list(raw_text.split())
words= sorted(list(set(raw_text.split())))
word_to_int = dict((w, i) for i, w in enumerate(words))
int_to_word = dict((i, w) for i, w in enumerate(words))
#print word_to_int

n_words = len(words_unsorted)
n_vocab = len(words)
print "Total Words: ", n_words
print "Total Vocab: ", n_vocab

# prepare the dataset of input to output pairs encoded as integers
seq_length = 7
dataX = []
dataY = []
for i in range(0, n_words - seq_length, 1):
seq_in = words_unsorted[i:i + seq_length]
seq_out = words_unsorted[i + seq_length]
#print seq_in
dataX.append([word_to_int[word] for word in seq_in])
dataY.append(word_to_int[seq_out])


n_patterns = len(dataX)
print "Total Patterns: ", n_patterns

# reshape X to be [samples, time steps, features]
X = numpy.reshape(dataX, (n_patterns, seq_length, 1))
print X[0]
# normalize
X = X / float(n_vocab)
# one hot encode the output variable
y = np_utils.to_categorical(dataY)
# define the LSTM model
model = Sequential()
model.add(LSTM(256, input_shape=(X.shape[1], X.shape[2]), return_sequences=True))
model.add(Dropout(0.2))
model.add(LSTM(256))
model.add(Dropout(0.2))
model.add(Dense(y.shape[1], activation='softmax'))
model.compile(loss='categorical_crossentropy', optimizer='adam')
print model.summary()
# define the checkpoint
filepath="weights-improvement-{epoch:02d}-{loss:.4f}.hdf5"
checkpoint = ModelCheckpoint(filepath, monitor='loss', verbose=1, save_best_only=True, mode='min')
callbacks_list = [checkpoint]
# fit the model
model.fit(X, y, epochs=50, batch_size=128, callbacks=callbacks_list)

问题是,当我预测一个测试句子时,我总是会得到“和”作为下一个单词预测!我应该删除所有停用词还是其他什么?此外,我正在训练它 20 个时期。

最佳答案

我很确定,根据帖子的年龄,您已经解决了您的问题。但以防万一,这是我的 2 美分。

您最终预测的是最常见的单词。因此,如果您删除停用词,您将预测下一个最常见的词。据我所知,有两种方法可以解决这个问题。

首先,您可以使用强调频率较低的类或您的情况下的单词的损失。这是一个research paper引入焦点损失,方便地,github为keras实现它。

另一种方法是在拟合函数中使用 class_weight。

model.fit(X, y, epochs=50, batch_size=128, callbacks=callbacks_list, class_weight=class_weight)

您可以在其中将频率较低的词的权重设置得较高,例如与频率成反比。

关于deep-learning - Keras Generative LSTM 仅预测停用词,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/43413812/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com