gpt4 book ai didi

python-3.x - 在 Keras 中使用单热编码创建模型

转载 作者:行者123 更新时间:2023-12-04 21:04:35 24 4
gpt4 key购买 nike

我正在研究句子分类问题并尝试使用 Keras 解决。
词汇表中的唯一单词总数为 36。

在这种情况下,总词汇是 [W1,W2,W3....W36]

所以,如果我有一个单词为 [W1 W2 W6 W7 W9] 的句子,如果我对其进行编码,我会得到一个 numpy 数组,如下所示

[[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1]
[0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0]
[0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]]

形状是 (5,36)

我被困在这里。我已经生成了 20000 个形状各异的 numpy 数组,即 (N,36)
其中 N 是句子中的单词数。所以,我有 20,000 个用于训练的句子和 100 个用于测试的句子,所有句子都标有 (1,36) 单热编码

我有 x_train、x_test、y_train 和 y_test

x_test 和 y_test 是维度 (1,36)

任何人都可以请建议我该怎么做?

我做了一些下面的编码
model = Sequential()
model.add(Dense(512, input_shape=(??????))),
model.add(Activation('relu'))
model.add(Dropout(0.5))
model.add(Dense(num_classes))
model.add(Activation('softmax'))
model.compile(loss='categorical_crossentropy',
optimizer='adam',
metrics=['accuracy'])

任何帮助将非常感激。

更新和回应@putonspectacles

非常感谢您花费时间和精力进行详细回复。我对您的代码进行了一些小的修改,我认为需要完成这些修改才能使代码正常工作。请在下面找到它
num_classes = 5 
max_words = 20
sentences = ["The cat is in the house","The green boy","computer programs are not alive while the children are"]
labels = np.random.randint(0, num_classes, 3)
y = to_categorical(labels, num_classes=num_classes)
words = set(w for sent in sentences for w in sent.split())
word_map = {w : i+1 for (i, w) in enumerate(words)}
#-Changed the below line the inner for loop sent to sent.split()
sent_ints = [[word_map[w] for w in sent.split()] for sent in sentences]
vocab_size = len(words)
print(vocab_size)
#-changed the below line - the outer for loop sentences to sent_ints
X = np.array([to_categorical(pad_sequences((sent,), max_words),vocab_size+1) for sent in sent_ints])
print(X)
print(y)
model = Sequential()
model.add(Dense(512, input_shape=(max_words, vocab_size + 1)))
model.add(LSTM(128))
model.add(Dense(5, activation='softmax'))
model.compile(loss='categorical_crossentropy',
optimizer='adam',
metrics=['accuracy'])
model.fit(X,y)

如果没有这些更改,代码将无法工作。当我运行上面的代码时,我得到了如下所示的正确嵌入
[[[[1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
[1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
[1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
[1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
[1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
[1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
[1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
[1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
[1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
[1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
[1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
[1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
[1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
[1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
[0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
[0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1.]
[0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
[0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0.]
[0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0.]
[0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0.]]]


[[[1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
[1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
[1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
[1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
[1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
[1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
[1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
[1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
[1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
[1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
[1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
[1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
[1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
[1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
[1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
[1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
[1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
[0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
[0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
[0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0.]]]


[[[1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
[1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
[1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
[1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
[1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
[1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
[1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
[1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
[1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
[1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
[1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
[0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0.]
[0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
[0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
[0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0.]
[0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0.]
[0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0.]
[0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0.]
[0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0.]
[0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]]]]



[[0. 0. 0. 0. 1.]
[1. 0. 0. 0. 0.]
[0. 1. 0. 0. 0.]]

但我得到的错误是“ 检查输入时出错:预期dense_44_input 有3 个维度,但得到了形状为(3, 1, 20, 16) 的数组”

当我将输入形状更改为
模型.添加(密集(512,input_shape =(无,max_words,vocab_size + 1)))

我收到错误“ 输入 0 与 lstm_27 层不兼容:预期 ndim=3,发现 ndim=4

我正在努力解决这个问题。如果你能给我一个方向,那就太好了。

我接受了答案,因为它回答了嵌入单词的目标。再次感谢。

最佳答案

酷,你清理了问题。你想对一个句子进行分类。我假设你说我想要比词袋编码做得更好。您想重视序列。

我们将选择一个新模型然后- an RNN (the LSTM version) .该模型有效地总结了每个单词(按顺序)的重要性,因为它构建了最适合任务的句子表示。

但是我们将不得不以不同的方式处理预处理。为了效率(以便我们可以批量处理更多句子而不是一次处理单个句子),我们希望所有句子都具有相同数量的单词。所以我们选择一个 max_words,比如 20,我们填充较短的句子以达到最大单词,然后我们将长度超过 20 个单词的句子删减。

Keras 将为此提供帮助。我们将用整数对每个单词进行编码。

from keras.preprocessing.sequence import pad_sequences
from keras.utils import to_categorical
from keras.models import Sequential
from keras.layers import Embedding, Dense, LSTM

num_classes = 5
max_words = 20
sentences = ["The cat is in the house",
"The green boy",
"computer programs are not alive while the children are"]
labels = np.random.randint(0, num_classes, 3)
y = to_categorical(labels, num_classes=num_classes)

words = set(w for sent in sentences for w in sent.split())
word_map = {w : i+1 for (i, w) in enumerate(words)}
sent_ints = [[word_map[w] for w in sent] for sent in sentences]
vocab_size = len(words)

所以“绿色男孩”现在可能是 [1, 3, 5]。
然后我们将填充和单热编码
# pad to max_words length and encode with len(words) + 1  
# + 1 because we'll reserve 0 add the padding sentinel.
X = np.array([to_categorical(pad_sequences((sent,), max_words),
vocab_size + 1) for sent in sent_ints])
print(X.shape) # (3, 20, 16)

现在到模型:我们将添加一个 Dense层转换那些热
词到密集向量。然后我们使用 LSTM转换词向量
在句子中到一个密集的句子向量。最后,我们将使用 softmax 激活来生成类的概率分布。
model = Sequential()
model.add(Dense(512, input_shape=(max_words, vocab_size + 1)))
model.add(LSTM(128))
model.add(Dense(5, activation='softmax'))
model.compile(loss='categorical_crossentropy',
optimizer='adam',
metrics=['accuracy'])

那应该是完整的。然后你可以继续训练。
model.fit(X,y)

编辑:

这一行:
# we need to split the sentences in a words write now it reading every
# letter notice the sent.split() in the correct version below.
sent_ints = [[word_map[w] for w in sent] for sent in sentences]

应该:
sent_ints = [[word_map[w] for w in sent.split()] for sent in sentences]

关于python-3.x - 在 Keras 中使用单热编码创建模型,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/49604765/

24 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com