gpt4 book ai didi

python - Keras fit_generator 与 pandas 迭代器对象

转载 作者:太空宇宙 更新时间:2023-11-03 11:19:00 25 4
gpt4 key购买 nike

我的 csv 太大而无法一次读入内存,所以我想将它分块并用它一 block 一 block 地拟合 keras 模型。我想我误解了 fit_generator 函数是如何工作的,因为我不断收到 StopIteration 错误,即使 chunksizesteps_per_epoch 正确说明了多少行在我的 csv 中。

代码:

import pandas as pd
import numpy as np
from keras.models import Sequential
from keras.layers import Dense, Dropout

np.random.seed(26)
x_train_generator = pd.read_csv('X_train.csv', header=None, chunksize=150000)
y_train_generator = pd.read_csv('Y_train.csv', header=None, chunksize=150000)
x_test_generator = pd.read_csv('X_test.csv', header=None, chunksize=50000)
y_test_generator = pd.read_csv('Y_test.csv', header=None, chunksize=50000)

model = Sequential()
model.add(Dense(500, input_dim=1132, activation='tanh'))
model.add(Dense(1, activation='sigmoid'))

model.compile(loss='binary_crossentropy', metrics=['accuracy'],
optimizer='adam')

model.fit_generator((x_train_generator.get_chunk().as_matrix(),
y_train_generator.get_chunk().as_matrix()),
steps_per_epoch=37,
epochs=1,
verbose=2,
validation_data=(x_test_generator.get_chunk().as_matrix(),
y_test_generator.get_chunk().as_matrix()),
validation_steps=37
)

错误输出:

Exception in thread Thread-107:                                                                                                                                                                             
Traceback (most recent call last):
File "/usr/lib/python2.7/threading.py", line 801, in __bootstrap_inner
self.run()
File "/usr/lib/python2.7/threading.py", line 754, in run
self.__target(*self.__args, **self.__kwargs)
File "/home/user/myenv/local/lib/python2.7/site-packages/keras/utils/data_utils.py", line 568, in data_generator_task
generator_output = next(self._generator)
TypeError: tuple object is not an iterator

---------------------------------------------------------------------------
StopIteration Traceback (most recent call last)
/home/user/tmp_keras.py in <module>()
22 verbose=2,
23 validation_data=(x_test_generator.get_chunk().as_matrix(), y_test_generator.get_chunk().as_matrix()),
---> 24 validation_steps=37
25 )
26

/home/user/myenv/local/lib/python2.7/site-packages/keras/legacy/interfaces.pyc in wrapper(*args, **kwargs)
85 warnings.warn('Update your `' + object_name +
86 '` call to the Keras 2 API: ' + signature, stacklevel=2)
---> 87 return func(*args, **kwargs)
88 wrapper._original_function = func
89 return wrapper

/home/user/myenv/local/lib/python2.7/site-packages/keras/models.pyc in fit_generator(self, generator, steps_per_epoch, epochs, verbose, callbacks, validation_data, validation_steps, class_weight, max_$ueue_size, workers, use_multiprocessing, initial_epoch)
1119 workers=workers,
1120 use_multiprocessing=use_multiprocessing,
-> 1121 initial_epoch=initial_epoch)
1122
1123 @interfaces.legacy_generator_methods_support

/home/user/myenv/local/lib/python2.7/site-packages/keras/legacy/interfaces.pyc in wrapper(*args, **kwargs)
85 warnings.warn('Update your `' + object_name +
86 '` call to the Keras 2 API: ' + signature, stacklevel=2)
---> 87 return func(*args, **kwargs)
88 wrapper._original_function = func
89 return wrapper

/home/user/myenv/local/lib/python2.7/site-packages/keras/engine/training.pyc in fit_generator(self, generator, steps_per_epoch, epochs, verbose, callbacks, validation_data, validation_steps, class_weig
ht, max_queue_size, workers, use_multiprocessing, shuffle, initial_epoch)
2009 batch_index = 0
2010 while steps_done < steps_per_epoch:
-> 2011 generator_output = next(output_generator)
2012
2013 if not hasattr(generator_output, '__len__'):

StopIteration:

奇怪的是,如果我将 fit_generator() 包装在 while 1: try: ... except StopIteration: 中,它会成功运行。

我试过在没有 get_chunk().as_matrix() 函数的 fit_generator 参数中使用 x/y_train_generator 但它失败了,因为我没有传递 keras a numpy 数组。

最佳答案

如评论中所述,您的问题是 Pandas .get_chunk() 返回一个迭代器,这是调用 .as_matrix() 方法的对象(并且这不是您想要发生的事情 - 您希望 .get_chunk() 返回的迭代器首先转换为 DataFrame,然后是 .as_matrix() 被调用)。

要重构您的代码,您需要一个循环,并且您需要在循环内更新您的模型。我有两个建议给你:

  1. (最简单) 重新构造上面的程序:在调用 .as_matrix() 之前,将 Pandas 中的每个 block 作为 DataFrame 进行循环它。这样,您实际上是为您的 X_trainy_trainX_testy_test 数据获取一个具体的 DataFrame,而不是一个IO迭代器。然后,您可以使用新的数据 block 更新经过训练的模型。 (如果您已经有一个经过训练的模型,并且您再次调用 .fit(),它将更新现有模型。)

  2. (使用 Keras 功能而不是 Pandas 功能)利用内置的 Keras 实用程序读取大型数据集 - 具体来说,一个名为 HDF5Matrix (link to Keras documentation) 的 Keras 实用程序以 block 的形式从 HDF5 文件中读取数据,并且该数据将被透明地视为 Numpy 数组。像这样:

    def load_data(path_todata, start_ix, n_samples):
    """
    This works for loading testing or training data.
    This assumes input data have been named "inputs",
    output data have been named "outputs" in HDF5 file,
    and that you are grabbing n_samples from the file.
    """
    X = HDF5Matrix(path_to_training_data, 'inputs', start_ix, start_ix + n_samples)
    y = HDF5Matrix(path_to_training_data, 'outputs', start_ix, start_ix + n_samples)
    return (X,y)

    X_train, y_train = load_data(path_to_training_h5, train_start_ix, n_training_samples)
    X_test, y_test = load_data(path_to_testing_h5, testing_start_ix, n_testing_samples)

与解决方案 #1 一样,这将在一个总体 for 循环中构建,该循环在每次迭代中更新 start_ixn_samples每次迭代中的模型。有关如何使用 HDF5Matrix 的另一个说明,请参阅 this example来自 Github 用户@jfsantos。

关于python - Keras fit_generator 与 pandas 迭代器对象,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/46638219/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com