gpt4 book ai didi

python - 将实体嵌入映射回原始分类值

转载 作者:行者123 更新时间:2023-11-30 09:43:46 24 4
gpt4 key购买 nike

我正在使用 Keras 嵌入层来创建在 Kaggle Rossmann 商店销售 3rd place entry. 上流行的实体嵌入。但是,我不确定如何将嵌入映射回实际的分类值。让我们看一个非常基本的示例:

在下面的代码中,我创建了一个包含两个数字特征和一个分类特征的数据集。

import numpy as np
import pandas as pd
from sklearn.datasets import make_classification
from keras.models import Model
from keras.layers import Input, Dense, Concatenate, Reshape, Dropout
from keras.layers.embeddings import Embedding

# create some fake data
data, labels = make_classification(n_classes=2, class_sep=2, n_informative=2,
n_redundant=0, flip_y=0, n_features=2,
n_clusters_per_class=1, n_samples=100,
random_state=10)

cat_col = np.random.choice(a=[0,1,2,3,4], size=100)

data = pd.DataFrame(data)
data[2] = cat_col
embed_cols = [2]

# converting data to list of lists, as the network expects to
# see the data in this format
def preproc(df):
data_list = []

# convert cols to list of lists
for c in embed_cols:
vals = np.unique(df[c])
val_map = {}
for i in range(len(vals)):
val_map[vals[i]] = vals[i]
data_list.append(df[c].map(val_map).values)

# the rest of the columns
other_cols = [c for c in df.columns if (not c in embed_cols)]
data_list.append(df[other_cols].values)
return data_list

data = preproc(data)

分类列有 5 个唯一值:

print("Unique Values: ", np.unique(data[0]))
Out[01]: array([0, 1, 2, 3, 4])

然后将其输入带有嵌入层的 Keras 模型中:

inputs = []
embeddings = []

input_cat_col = Input(shape=(1,))
embedding = Embedding(5, 3, input_length=1, name='cat_col')(input_cat_col)
embedding = Reshape(target_shape=(3,))(embedding)
inputs.append(input_cat_col)
embeddings.append(embedding)


# add the remaining two numeric columns from the 'data array' to the network
input_numeric = Input(shape=(2,))
embedding_numeric = Dense(8)(input_numeric)
inputs.append(input_numeric)
embeddings.append(embedding_numeric)

x = Concatenate()(embeddings)
output = Dense(1, activation='sigmoid')(x)

model = Model(inputs, output)
model.compile(loss='binary_crossentropy', optimizer='adam')

history = model.fit(data, labels,
epochs=10,
batch_size=32,
verbose=1,
validation_split=0.2)

我可以通过获取嵌入层的权重来获取实际的嵌入:

embeddings = model.get_layer('cat_col').get_weights()[0]
print("Unique Values: ", np.unique(data[0]))
print("3 Dimensional Embedding: \n", embeddings)

Unique Values: [0 1 2 3 4]
3 Dimensional Embedding:
[[ 0.02749949 0.04238378 0.0080842 ]
[-0.00083209 0.01848664 0.0130044 ]
[-0.02784528 -0.00713446 -0.01167112]
[ 0.00265562 0.03886909 0.0138318 ]
[-0.01526615 0.01284053 -0.0403452 ]]

但是,我不确定如何将它们映射回来。可以安全地假设权重已排序吗?例如,0=[ 0.02749949 0.04238378 0.0080842 ]

最佳答案

是的,嵌入层的权重对应于按顺序按整数索引的单词,即嵌入层中的权重数组 0 对应于索引为 0 的单词,依此类推。您可以将嵌入层视为一个查找表,其中表的nth行对应于nth 个单词(但嵌入层是可训练层,而不仅仅是静态查找表)

inputs = Input(shape=(1,))
embedding = Embedding(5, 3, input_length=1, name='cat_col')(inputs)
model = Model(inputs, embedding)

x = np.array([0,1,2,3,4]).reshape(5,1)
labels = np.zeros((5,1,3))

print (model.predict(x))
print (model.get_layer('cat_col').get_weights()[0])

assert np.array_equal(model.predict(x).reshape(-1), model.get_layer('cat_col').get_weights()[0].reshape(-1))

模型.预测(x):

[[[-0.01862894,  0.0021644 ,  0.04706952]],
[[-0.03891206, 0.01743075, -0.03666048]],
[[-0.01799501, 0.01427511, -0.00056203]],
[[ 0.03703432, -0.01952349, 0.04562894]],
[[-0.02806044, -0.04623617, -0.01702447]]]

model.get_layer('cat_col').get_weights()[0]

[[-0.01862894,  0.0021644 ,  0.04706952],
[-0.03891206, 0.01743075, -0.03666048],
[-0.01799501, 0.01427511, -0.00056203],
[ 0.03703432, -0.01952349, 0.04562894],
[-0.02806044, -0.04623617, -0.01702447]]

关于python - 将实体嵌入映射回原始分类值,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/55343375/

24 4 0