gpt4 book ai didi

python - 在神经网络中准备训练和验证数据时出错

转载 作者:行者123 更新时间:2023-12-04 03:55:24 25 4
gpt4 key购买 nike

我遇到的问题是,一旦我想根据数据拆分 (model.fit(...)) 训练模型,我就会收到错误

InvalidArgumentError: indices [0] = 261429 is not in [0, 235061)
[[node recommender_net_3 / embedding_15 / embedding_lookup (defined at <ipython-input-46-e2a6cff5eb06>: 29)]] [Op: __ inference_train_function_9058]

我正在使用 RetailRocket 数据集。你可以在这里找到它https://www.kaggle.com/retailrocket/ecommerce-dataset .

你可以在下面看到我的实现

import pandas as pd
import numpy as np
import tensorflow as tf
import matplotlib.pyplot as plt

from tensorflow import keras
from tensorflow.keras import layers
from pathlib import Path
from google.colab import drive

drive.mount('/content/drive')
import os
for dirname, _, filenames in os.walk('/content/drive/My Drive/Dataset'):
for filename in filenames:
print(os.path.join(dirname, filename))
path = '/content/drive/My Drive/Dataset/'

# items = pd.concat([pd.read_csv(path+'item_properties_part2.csv'),
# pd.read_csv(path+'item_properties_part1.csv')])
# items.shape

events = pd.read_csv(path+'events.csv')
events.shape

df_event = pd.read_csv(path+ "events.csv")
print("file events.csv")
df_event.head()
df_event['code'] = df_event['event'].map({"view":1, "addtocart":2, "addtocart":3})
df_event.head()


# Visitor
visitor_ids = df_event["visitorid"].unique().tolist()
visitor2visitor_encoded = {x: i for i, x in enumerate(visitor_ids)}
visitorencoded2visitor = {i: x for i, x in enumerate(visitor_ids)}

# Items
items_ids = df_event["itemid"].unique().tolist()
item2item_encoded = {x: i for i, x in enumerate(items_ids)}
item_encoded2item = {i: x for i, x in enumerate(items_ids)}


df_event["visitor"] = df_event["visitorid"].map(visitor2visitor_encoded)
df_event["item"] = df_event["itemid"].map(item2item_encoded)

num_visitors = len(visitor2visitor_encoded)
num_items = len(item_encoded2item)

event = df_event["event"].value_counts()
#min_rating = min(df["rating"])
#max_rating = max(df["rating"])

print("Number of visitors: {}, Number of items: {}".format(num_visitors, num_items))
print("Number of views: {}, Number of addtocart: {}, Number of transactions: {}".format(event[0], event[1], event[2]))

## The Error
x = df[["visitorid", "itemid"]].values
# Normalize the targets between 0 and 1. Makes it easy to train.
#y = (df[col] - df[col].mean())/df[col].std()
df['z_score'] = (df['code'] - df['code'].mean())/df['code'].std()
y = df['z_score'].values
# Assuming training on 90% of the data and validating on 10%.
train_indices = int(0.9 * df.shape[0])
x_train, x_val, y_train, y_val = (
x[:train_indices],
x[train_indices:],
y[:train_indices],
y[train_indices:],
)

##

EMBEDDING_SIZE = 50


class RecommenderNet(keras.Model):
def __init__(self, num_visitors, num_items, embedding_size, **kwargs):
super(RecommenderNet, self).__init__(**kwargs)
self.num_visitors = num_visitors
self.num_items = num_items
self.embedding_size = embedding_size
self.visitor_embedding = layers.Embedding(
num_visitors,
embedding_size,
embeddings_initializer="he_normal",
embeddings_regularizer=keras.regularizers.l2(1e-6),
)
self.visitor_bias = layers.Embedding(num_visitors, 1)
self.item_embedding = layers.Embedding(
num_items,
embedding_size,
embeddings_initializer="he_normal",
embeddings_regularizer=keras.regularizers.l2(1e-6),
)
self.item_bias = layers.Embedding(num_items, 1)

def call(self, inputs):
visitor_vector = self.visitor_embedding(inputs[:, 0])
visitor_bias = self.visitor_bias(inputs[:, 0])
item_vector = self.item_embedding(inputs[:, 1])
item_bias = self.item_bias(inputs[:, 1])
dot_visitor_item = tf.tensordot(visitor_vector, item_vector, 2)
# Add all the components (including bias)
x = dot_visitor_item + visitor_bias + item_bias
# The sigmoid activation forces the rating to between 0 and 1
return tf.nn.sigmoid(x)


model = RecommenderNet(num_visitors, num_items, EMBEDDING_SIZE)
model.compile(
loss=tf.keras.losses.BinaryCrossentropy(), optimizer=keras.optimizers.Adam(lr=0.001)
)

## The InvalidArgumentError

history = model.fit(
x=x_train,
y=y_train,
batch_size=64,
epochs=5,
verbose=1,
validation_data=(x_val, y_val),)

可是我改的时候才发现

# old code
x_train, x_val, y_train, y_val = (
x[:train_indices],
x[train_indices:],
y[:train_indices],
y[train_indices:],
)

# new code
X_temp, X_test, y_temp, y_test = train_test_split(df_event[['visitorid', 'itemid']],
df_event['code'],
test_size=0.2,
random_state=1)

X_train, X_val, y_train, y_val = train_test_split(X_temp, y_temp, test_size=0.2, random_state=1)

X_train.shape, X_val.shape, X_test.shape

使用新代码,我没有遇到任何错误,而且实现也没有错误。有人可以告诉我如何正确编写“旧”代码以使其运行吗?两者之间有什么区别?提前致谢!

最佳答案

https://github.com/tensorflow/tensorflow/issues/23698 ,据说错误可能是因为您具有第一个 Embedding 维度,即词汇量对于 NLP(或其他)任务问题而言太小;然而,之前的分词器/单词计数器检测到有更多独特的单词,因此出现以下错误。

InvalidArgumentError: indices [0] = 261429 is not in [0, 235061)

这意味着您有 261429 个不同的词/项目/元素,但嵌入维度集是 235061,因此省略 261429 - 235061 = 26368 单词/元素。

虽然我假设在你的第二个解决方案中有更少的单词并且你的代码有效,但一个可能的正确解决方案(尽管在下面硬编码)可能是增加 num_itemsnum_visitors到 261429(从错误行来看我会说它来自 num_items 但我不是 100% 确定);请在两个 Embedding() 层上进行测试以检测引发错误的层:

    self.item_embedding = layers.Embedding(
261429,
embedding_size,
embeddings_initializer="he_normal",
embeddings_regularizer=keras.regularizers.l2(1e-6),
)
self.item_bias = layers.Embedding(261429, 1)

关于python - 在神经网络中准备训练和验证数据时出错,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/63987295/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com