gpt4 book ai didi

python - 如何在 Gensim 的 Word2Vec 中正确使用 get_keras_embedding()?

转载 作者:太空狗 更新时间:2023-10-30 00:01:07 26 4
gpt4 key购买 nike

我正在尝试使用嵌入和 RNN 构建翻译网络。我训练了一个 Gensim Word2Vec 模型,它可以很好地学习单词关联。但是,我无法理解如何将图层正确添加到 Keras 模型中。 (以及如何对输出进行“反向嵌入”。但这是另一个已经回答的问题:默认情况下你不能。)

在 Word2Vec 中,当你输入一个字符串时,例如model['hello'],你得到了这个词的向量表示。但是,我相信 Word2Vec 的 get_keras_embedding() 返回的 keras.layers.Embedding 层采用单热/标记化输入,而不是字符串输入。但是文档没有解释什么是合适的输入。 我不知道如何获得与嵌入层输入一一对应的词汇表的单热/标记化向量。

下面有更多详细说明:

目前我的解决方法是在将嵌入提供给网络之前在 Keras 外部应用嵌入。这样做有什么坏处吗?无论如何,我都会将嵌入设置为不可训练。到目前为止,我注意到内存使用效率极低(甚至在为 64 字长的句子集合声明 Keras 模型之前就占用了 50GB)必须在模型外部加载填充的输入和权重。也许生成器可以提供帮助。

以下是我的代码。输入被填充为 64 个字长。 Word2Vec 嵌入有 300 个维度。由于试图使嵌入工作的反复实验,这里可能有很多错误。欢迎提出建议。

import gensim
word2vec_model = gensim.models.Word2Vec.load(“word2vec.model")
from keras.models import Sequential
from keras.layers import Embedding, GRU, Input, Flatten, Dense, TimeDistributed, Activation, PReLU, RepeatVector, Bidirectional, Dropout
from keras.optimizers import Adam, Adadelta
from keras.callbacks import ModelCheckpoint
from keras.losses import sparse_categorical_crossentropy, mean_squared_error, cosine_proximity

keras_model = Sequential()
keras_model.add(word2vec_model.get_keras_embedding(train_embeddings=False))
keras_model.add(Bidirectional(GRU(300, return_sequences=True, dropout=0.1, recurrent_dropout=0.1, activation='tanh')))
keras_model.add(TimeDistributed(Dense(600, activation='tanh')))
# keras_model.add(PReLU())
# ^ For some reason I get error when I add Activation ‘outside’:
# int() argument must be a string, a bytes-like object or a number, not 'NoneType'
# But keras_model.add(Activation('relu')) works.
keras_model.add(Dense(source_arr.shape[1] * source_arr.shape[2]))
# size = max-output-sentence-length * embedding-dimensions to learn the embedding vector and find the nearest word in word2vec_model.similar_by_vector() afterwards.
# Alternatively one can use Dense(vocab_size) and train the network to output one-hot categorical words instead.
# Remember to change Keras loss to sparse_categorical_crossentropy.
# But this won’t benefit from Word2Vec.

keras_model.compile(loss=mean_squared_error,
optimizer=Adadelta(),
metrics=['mean_absolute_error'])
keras_model.summary()
_________________________________________________________________ 
Layer (type) Output Shape Param #
=================================================================
embedding_19 (Embedding) (None, None, 300) 8219700
_________________________________________________________________
bidirectional_17 (Bidirectio (None, None, 600) 1081800
_________________________________________________________________
activation_4 (Activation) (None, None, 600) 0
_________________________________________________________________
time_distributed_17 (TimeDis (None, None, 600) 360600
_________________________________________________________________
dense_24 (Dense) (None, None, 19200) 11539200
=================================================================
Total params: 21,201,300
Trainable params: 12,981,600
Non-trainable params: 8,219,700
_________________________________________________________________
filepath="best-weights.hdf5"
checkpoint = ModelCheckpoint(filepath, monitor='val_mean_absolute_error', verbose=1, save_best_only=True, mode='auto')
callbacks_list = [checkpoint]
keras_model.fit(array_of_word_lists, array_of_word_lists_AFTER_being_transformed_by_word2vec, epochs=100, batch_size=2000, shuffle=True, callbacks=callbacks_list, validation_split=0.2)

当我尝试用文本拟合模型时抛出错误:

Train on 8000 samples, validate on 2000 samples Epoch 1/100

---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-32-865f8b75fbc3> in <module>()
2 checkpoint = ModelCheckpoint(filepath, monitor='val_mean_absolute_error', verbose=1, save_best_only=True, mode='auto')
3 callbacks_list = [checkpoint]
----> 4 keras_model.fit(array_of_word_lists, array_of_word_lists_AFTER_being_transformed_by_word2vec, epochs=100, batch_size=2000, shuffle=True, callbacks=callbacks_list, validation_split=0.2)

~/virtualenv/lib/python3.6/site-packages/keras/engine/training.py in fit(self, x, y, batch_size, epochs, verbose, callbacks, validation_split, validation_data, shuffle, class_weight, sample_weight, initial_epoch, steps_per_epoch, validation_steps, **kwargs)
1040 initial_epoch=initial_epoch,
1041 steps_per_epoch=steps_per_epoch,
-> 1042 validation_steps=validation_steps)
1043
1044 def evaluate(self, x=None, y=None,

~/virtualenv/lib/python3.6/site-packages/keras/engine/training_arrays.py in fit_loop(model, f, ins, out_labels, batch_size, epochs, verbose, callbacks, val_f, val_ins, shuffle, callback_metrics, initial_epoch, steps_per_epoch, validation_steps)
197 ins_batch[i] = ins_batch[i].toarray()
198
--> 199 outs = f(ins_batch)
200 if not isinstance(outs, list):
201 outs = [outs]

~/virtualenv/lib/python3.6/site-packages/keras/backend/tensorflow_backend.py in __call__(self, inputs)
2659 return self._legacy_call(inputs)
2660
-> 2661 return self._call(inputs)
2662 else:
2663 if py_any(is_tensor(x) for x in inputs):

~/virtualenv/lib/python3.6/site-packages/keras/backend/tensorflow_backend.py in _call(self, inputs)
2612 array_vals.append(
2613 np.asarray(value,
-> 2614 dtype=tensor.dtype.base_dtype.name))
2615 if self.feed_dict:
2616 for key in sorted(self.feed_dict.keys()):

~/virtualenv/lib/python3.6/site-packages/numpy/core/numeric.py in asarray(a, dtype, order)
490
491 """
--> 492 return array(a, dtype, copy=False, order=order)
493
494

ValueError: could not convert string to float: 'hello'

以下是an excerpt from Rajmak演示如何使用分词器将单词转换为 Keras 嵌入的输入。

tokenizer = Tokenizer(num_words=MAX_NB_WORDS) 
tokenizer.fit_on_texts(all_texts)
sequences = tokenizer.texts_to_sequences(texts)
word_index = tokenizer.word_index
……
indices = np.arange(data.shape[0]) # get sequence of row index
np.random.shuffle(indices) # shuffle the row indexes
data = data[indices] # shuffle data/product-titles/x-axis
……
nb_validation_samples = int(VALIDATION_SPLIT * data.shape[0])
x_train = data[:-nb_validation_samples]
……
word2vec = KeyedVectors.load_word2vec_format(EMBEDDING_FILE, binary=True)

Keras embedding layer can be obtained by Gensim Word2Vec’s word2vec.get_keras_embedding(train_embeddings=False) method or constructed like shown below. The null word embeddings indicate the number of words not found in our pre-trained vectors (In this case Google News). This could possibly be unique words for brands in this context.

from keras.layers import Embedding
word_index = tokenizer.word_index
nb_words = min(MAX_NB_WORDS, len(word_index))+1

embedding_matrix = np.zeros((nb_words, EMBEDDING_DIM))
for word, i in word_index.items():
if word in word2vec.vocab:
embedding_matrix[i] = word2vec.word_vec(word)
print('Null word embeddings: %d' % np.sum(np.sum(embedding_matrix, axis=1) == 0))

embedding_layer = Embedding(embedding_matrix.shape[0], # or len(word_index) + 1
embedding_matrix.shape[1], # or EMBEDDING_DIM,
weights=[embedding_matrix],
input_length=MAX_SEQUENCE_LENGTH,
trainable=False)

from keras.models import Sequential
from keras.layers import Conv1D, GlobalMaxPooling1D, Flatten
from keras.layers import Dense, Input, LSTM, Embedding, Dropout, Activation

model = Sequential()
model.add(embedding_layer)
model.add(Dropout(0.2))
model.add(Conv1D(300, 3, padding='valid',activation='relu',strides=2))
model.add(Conv1D(150, 3, padding='valid',activation='relu',strides=2))
model.add(Conv1D(75, 3, padding='valid',activation='relu',strides=2))
model.add(Flatten())
model.add(Dropout(0.2))
model.add(Dense(150,activation='sigmoid'))
model.add(Dropout(0.2))
model.add(Dense(3,activation='sigmoid'))

model.compile(loss='categorical_crossentropy',optimizer='rmsprop',metrics=['acc'])

model.summary()

model.fit(x_train, y_train, validation_data=(x_val, y_val), epochs=2, batch_size=128)
score = model.evaluate(x_val, y_val, verbose=0)
print('Test loss:', score[0])
print('Test accuracy:', score[1])

此处 embedding_layer 是使用以下方法显式创建的:

for word, i in word_index.items():
if word in word2vec.vocab:
embedding_matrix[i] = word2vec.word_vec(word)

但是,如果我们使用 get_keras_embedding(),嵌入矩阵已经构建并固定。我不知道 Tokenizer 中的每个 word_index 是如何强制匹配的get_keras_embedding() 的 Keras 嵌入输入中的相应单词。

那么,在 Keras 中使用 Word2Vec 的 get_keras_embedding() 的正确方法是什么?

最佳答案

所以我找到了解决方案。 Tokenized word 索引可以在 word2vec_model.wv.vocab[word].index 中找到,反过来可以通过 word2vec_model.wv.index2word[word_index] 获得。 get_keras_embedding() 将前者作为输入。

我按如下方式进行转换:

source_word_indices = []
for i in range(len(array_of_word_lists)):
source_word_indices.append([])
for j in range(len(array_of_word_lists[i])):
word = array_of_word_lists[i][j]
if word in word2vec_model.wv.vocab:
word_index = word2vec_model.wv.vocab[word].index
source_word_indices[i].append(word_index)
else:
# Do something. For example, leave it blank or replace with padding character's index.
source_word_indices[i].append(padding_index)
source = numpy.array(source_word_indices)

最后 keras_model.fit(source, ...

关于python - 如何在 Gensim 的 Word2Vec 中正确使用 get_keras_embedding()?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/51492778/

26 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com