gpt4 book ai didi

python - 生成器和序列之间的 Keras 区别

转载 作者:行者123 更新时间:2023-12-01 07:15:16 26 4
gpt4 key购买 nike

我正在使用深度 CNN+LSTM 网络对一维信号数据集进行分类。我正在使用 keras 2.2.4支持 tensorflow 1.12.0 .由于我有一个很大的数据集和有限的资源,我在训练阶段使用生成器将数据加载到内存中。首先,我尝试了这个生成器:

def data_generator(batch_size, preproc, type, x, y):
num_examples = len(x)
examples = zip(x, y)
examples = sorted(examples, key = lambda x: x[0].shape[0])
end = num_examples - batch_size + 1
batches = [examples[i:i + batch_size] for i in range(0, end, batch_size)]

random.shuffle(batches)
while True:
for batch in batches:
x, y = zip(*batch)
yield preproc.process(x, y)
使用上述方法,我能够以一次最多 30 个样本的小批量大小启动训练。但是,这种方法并不能保证网络在每个 epoch 的每个样本上只训练一次。考虑到来自 Keras 网站的评论:

Sequence is a safer way to do multiprocessing. This structureguarantees that the network will only train once on each sample perepoch which is not the case with generators.


我尝试了另一种使用以下类加载数据的方法:
class Data_Gen(Sequence):

def __init__(self, batch_size, preproc, type, x_set, y_set):
self.x, self.y = np.array(x_set), np.array(y_set)
self.batch_size = batch_size
self.indices = np.arange(self.x.shape[0])
np.random.shuffle(self.indices)
self.type = type
self.preproc = preproc

def __len__(self):
# print(self.type + ' - len : ' + str(int(np.ceil(self.x.shape[0] / self.batch_size))))
return int(np.ceil(self.x.shape[0] / self.batch_size))

def __getitem__(self, idx):
inds = self.indices[idx * self.batch_size:(idx + 1) * self.batch_size]
batch_x = self.x[inds]
batch_y = self.y[inds]
return self.preproc.process(batch_x, batch_y)

def on_epoch_end(self):
np.random.shuffle(self.indices)
我可以确认使用这种方法网络在每个 epoch 的每个样本上训练一次,但是这次当我在 mini-batch 中放入超过 7 个样本时,出现内存不足错误:

OP_REQUIRES failed at random_op.cc: 202: Resource exhausted: OOM whenallocating tensor with shape...............


我可以确认我使用相同的模型架构、配置和机器来做这个测试。我想知道为什么这两种加载数据的方式会有区别?
如果需要,请随时询问更多详细信息。
提前致谢。
编辑:
这是我用来拟合模型的代码:
reduce_lr = keras.callbacks.ReduceLROnPlateau(
factor=0.1,
patience=2,
min_lr=params["learning_rate"])

checkpointer = keras.callbacks.ModelCheckpoint(
filepath=str(get_filename_for_saving(save_dir)),
save_best_only=False)

batch_size = params.get("batch_size", 32)

path = './logs/run-{0}'.format(datetime.now().strftime("%b %d %Y %H:%M:%S"))
tensorboard = keras.callbacks.TensorBoard(log_dir=path, histogram_freq=0,
write_graph=True, write_images=False)
if index == 0:
print(model.summary())
print("Model memory needed for batchsize {0} : {1} Gb".format(batch_size, get_model_memory_usage(batch_size, model)))

if params.get("generator", False):
train_gen = load.data_generator(batch_size, preproc, 'Train', *train)
dev_gen = load.data_generator(batch_size, preproc, 'Dev', *dev)
valid_metrics = Metrics(dev_gen, len(dev[0]) // batch_size, batch_size)
model.fit_generator(
train_gen,
steps_per_epoch=len(train[0]) / batch_size + 1 if len(train[0]) % batch_size != 0 else len(train[0]) // batch_size,
epochs=MAX_EPOCHS,
validation_data=dev_gen,
validation_steps=len(dev[0]) / batch_size + 1 if len(dev[0]) % batch_size != 0 else len(dev[0]) // batch_size,
callbacks=[valid_metrics, MyCallback(), checkpointer, reduce_lr, tensorboard])

# train_gen = load.Data_Gen(batch_size, preproc, 'Train', *train)
# dev_gen = load.Data_Gen(batch_size, preproc, 'Dev', *dev)
# model.fit_generator(
# train_gen,
# epochs=MAX_EPOCHS,
# validation_data=dev_gen,
# callbacks=[valid_metrics, MyCallback(), checkpointer, reduce_lr, tensorboard])

最佳答案

这些方法大致相同。子类化是正确的Sequence当您的数据集不适合内存时。但你不应该
在任何类的方法中运行任何预处理,因为这将
每个 epoch 重新执行一次,浪费大量计算资源。

打乱样本可能也比打乱样本更容易
指数。像这样:

从随机导入随机播放

class DataGen(Sequence):
def __init__(self, batch_size, preproc, type, x_set, y_set):
self.samples = list(zip(x, y))
self.batch_size = batch_size
shuffle(self.samples)
self.type = type
self.preproc = preproc

def __len__(self):
return int(np.ceil(len(self.samples) / self.batch_size))

def __getitem__(self, i):
batch = self.samples[i * self.batch_size:(i + 1) * self.batch_size]
return self.preproc.process(*zip(batch))

def on_epoch_end(self):
shuffle(self.samples)

我认为不可能说为什么你的内存不足
更了解您的数据。我的猜测是您的 preproc函数做错了。您可以通过运行来调试它:
for e in DataGen(batch_size, preproc, *train):
print(e)
for e in DataGen(batch_size, preproc, *dev):
print(e)

您很可能会耗尽内存。

关于python - 生成器和序列之间的 Keras 区别,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/56460901/

26 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com