python - 未从混洗数据集中选择 Keras ImageDataGenerator 验证拆分-6ren

python - 未从混洗数据集中选择 Keras ImageDataGenerator 验证拆分

转载作者：行者123 更新时间：2023-12-02 02:46:43

如何将我的图像数据集随机拆分为训练和验证数据集？更具体地说，Keras ImageDataGenerator 函数中的 validation_split 参数不会将我的图像随机拆分为训练和验证，而是从未打乱的数据集中切分验证样本。

最佳答案

在 Keras 的 ImageDataGenerator 中指定 validation_split 参数时，拆分会在数据打乱之前执行，以便仅获取最后的 x 个样本。问题是最后一个被选作验证的数据样本可能不代表训练数据，因此它可能会失败。当您的图像数据存储在一个公共(public)目录中且每个子文件夹都按类命名时，这是一个特别常见的死胡同。已在多个帖子中指出:

Choose random validation data set

As you mentioned, Keras simply takes the last x samples of the dataset, so if you want to keep using it, you need to shuffle your dataset in advance.

The training accuracy is very high, while the validation accuracy is very low?

please check if you have shuffled the data before training. Because the validation splitting in keras is performed before shuffle, so maybe you have chosen an unbalanced dataset as your validation set, thus you got the low accuracy.

Does 'validation split' randomly choose validation sample?

The validation data is picked as the last 10% (for instance, if validation_split=0.9) of the input. The training data (the remainder) can optionally be shuffled at every epoch (shuffle argument in fit). That doesn't affect the validation data, obviously, it has to be the same set from epoch to epoch.

This answer指向 sklearn train_test_split() 作为解决方案，但我想提出一个不同的解决方案，以保持 keras 工作流程的一致性。

随着split-folders package 你可以随机将你的主要数据目录分成训练、验证和测试(或只是训练和验证)目录。类特定的子文件夹会自动复制。

输入文件夹应具有以下格式:

input/
    class1/
        img1.jpg
        img2.jpg
        ...
    class2/
        imgWhatever.jpg
        ...
    ...

为了给你这个:

output/
    train/
        class1/
            img1.jpg
            ...
        class2/
            imga.jpg
            ...
    val/
        class1/
            img2.jpg
            ...
        class2/
            imgb.jpg
            ...
    test/            # optional
        class1/
            img3.jpg
            ...
        class2/
            imgc.jpg
            ...

来自文档:

import split_folders

# Split with a ratio.
# To only split into training and validation set, set a tuple to `ratio`, i.e, `(.8, .2)`.
split_folders.ratio('input_folder', output="output", seed=1337, ratio=(.8, .1, .1)) # default values

# Split val/test with a fixed number of items e.g. 100 for each set.
# To only split into training and validation set, use a single number to `fixed`, i.e., `10`.
split_folders.fixed('input_folder', output="output", seed=1337, fixed=(100, 100), oversample=False) # default values

通过这种新的文件夹安排，您可以轻松地使用 keras 数据生成器将您的数据划分为训练和验证，并最终训练您的模型。

import tensorflow as tf
import split_folders
import os

main_dir = '/Volumes/WMEL/Independent Research Project/Data/test_train/Data'
output_dir = '/Volumes/WMEL/Independent Research Project/Data/test_train/output'

split_folders.ratio(main_dir, output=output_dir, seed=1337, ratio=(.7, .3))

train_datagen = tf.keras.preprocessing.image.ImageDataGenerator(
    rescale=1./224)

train_generator = train_datagen.flow_from_directory(os.path.join(output_dir,'train'),
                                                    class_mode='categorical',
                                                    batch_size=32,
                                                    target_size=(224,224),
                                                    shuffle=True)

validation_generator = train_datagen.flow_from_directory(os.path.join(output_dir,'val'),
                                                        target_size=(224, 224),
                                                        batch_size=32,
                                                        class_mode='categorical',
                                                        shuffle=True) # set as validation data

base_model = tf.keras.applications.ResNet50V2(
    input_shape=IMG_SHAPE,
    include_top=False,
    weights=None)

maxpool_layer = tf.keras.layers.GlobalMaxPooling2D()
prediction_layer = tf.keras.layers.Dense(4, activation='softmax')

model = tf.keras.Sequential([
    base_model,
    maxpool_layer,
    prediction_layer
])

opt = tf.keras.optimizers.Adam(lr=0.004)
model.compile(optimizer=opt,
              loss=tf.keras.losses.CategoricalCrossentropy(),
              metrics=['accuracy'])

model.fit(
    train_generator,
    steps_per_epoch = train_generator.samples // 32,
    validation_data = validation_generator,
    validation_steps = validation_generator.samples // 32,
    epochs = 20)