gpt4 book ai didi

python - keras 启动时间 (_make_train_function()) 在 Tesla V100-SXM2-16GB GPU 上非常慢,与功能较弱的 GPU 相比

转载 作者:行者123 更新时间:2023-11-28 18:17:30 25 4
gpt4 key购买 nike

跟进:keras with tensorflow on GPU machine - some parts are very slow

从 tensorflow 1.4 运行 mnist_cnn.py(稍作修改 - 主要添加日志记录)

运行是使用预构建的 docker 镜像完成的:tensorflow/tensorflow:1.4.0-gpu-py3

在 p2.xlarge aws 机器(具有 Tesla K80 GPU)上性能良好,第一个批处理(主要是调用 _make_train_function)大约需要 2 秒:(参见开始批处理和结束批处理的时间戳)

2017-11-19 08:26:26,172 : INFO : fit

2017-11-19 08:26:26,637 : INFO : begin batch
2017-11-19 08:26:26.638409: I tensorflow/core/platform/cpu_feature_guard.cc:137] Your CPU supports instructions that this TensorFlow binary was not compiled to use: SSE4.1 SSE4.2 AVX AVX2 FMA
2017-11-19 08:26:26.760940: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:892] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2017-11-19 08:26:26.761478: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1030] Found device 0 with properties:
name: Tesla K80 major: 3 minor: 7 memoryClockRate(GHz): 0.8235
pciBusID: 0000:00:1e.0
totalMemory: 11.17GiB freeMemory: 11.11GiB
2017-11-19 08:26:26.761506: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1120] Creating TensorFlow device (/device:GPU:0) -> (device: 0, name: Tesla K80, pci bus id: 0000:00:1e.0, compute capability: 3.7)

2017-11-19 08:26:28,135 : INFO : end batch
x_train shape: (60000, 28, 28, 1)
60000 train samples
10000 test samples
Train on 60000 samples, validate on 10000 samples
Epoch 1/1
60000/60000 [==============================] - 12s - loss: 0.3526 - acc: 0.8920 - val_loss: 0.0818 - val_acc: 0.9755
Test loss: 0.081773182778
Test accuracy: 0.9755

在 p3.2xlarge 机器(带有 Tesla V100-SXM2-16GB GPU)上,相同的部分大约需要 10 分钟

2017-11-19 08:26:44,120 : INFO : fit

2017-11-19 08:26:44,715 : INFO : begin batch
2017-11-19 08:26:44.716680: I tensorflow/core/platform/cpu_feature_guard.cc:137] Your CPU supports instructions that this TensorFlow binary was not compiled to use: SSE4.1 SSE4.2 AVX AVX2 FMA
2017-11-19 08:26:46.108295: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:892] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2017-11-19 08:26:46.108775: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1030] Found device 0 with properties:
name: Tesla V100-SXM2-16GB major: 7 minor: 0 memoryClockRate(GHz): 1.53
pciBusID: 0000:00:1e.0
totalMemory: 15.77GiB freeMemory: 15.36GiB
2017-11-19 08:26:46.108815: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1120] Creating TensorFlow device (/device:GPU:0) -> (device: 0, name: Tesla V100-SXM2-16GB, pci bus id: 0000:00:1e.0, compute capability: 7.0)

2017-11-19 08:36:16,552 : INFO : end batch
x_train shape: (60000, 28, 28, 1)
60000 train samples
10000 test samples
Train on 60000 samples, validate on 10000 samples
Epoch 1/1
60000/60000 [==============================] - 576s - loss: 0.3418 - acc: 0.8949 - val_loss: 0.0769 - val_acc: 0.9772
Test loss: 0.0769035610346
Test accuracy: 0.9772

使用的代码:

#!/usr/bin/env python
'''Trains a simple convnet on the MNIST dataset.

Gets to 99.25% test accuracy after 12 epochs
(there is still a lot of margin for parameter tuning).
16 seconds per epoch on a GRID K520 GPU.
'''

from __future__ import print_function
import cProfile
import os
from tensorflow.contrib import keras
from tensorflow.contrib.keras import backend as K
import logging


logger = logging.getLogger(__name__)
logging.basicConfig(level=logging.INFO, format='\n%(asctime)s : %(levelname)s : %(message)s')

class callback(keras.callbacks.Callback):
def on_batch_begin(self, batch, logs=None):
if batch <= 1:
logger.info('begin batch')

class callback(keras.callbacks.Callback):
def on_batch_end(self, batch, logs=None):
if batch <= 1:
logger.info('end batch')

batch_size = 128
num_classes = 10
epochs = 1

# input image dimensions
img_rows, img_cols = 28, 28

# the data, shuffled and split between train and test sets
(x_train, y_train), (x_test, y_test) = keras.datasets.mnist.load_data()

if K.image_data_format() == 'channels_first':
x_train = x_train.reshape(x_train.shape[0], 1, img_rows, img_cols)
x_test = x_test.reshape(x_test.shape[0], 1, img_rows, img_cols)
input_shape = (1, img_rows, img_cols)
else:
x_train = x_train.reshape(x_train.shape[0], img_rows, img_cols, 1)
x_test = x_test.reshape(x_test.shape[0], img_rows, img_cols, 1)
input_shape = (img_rows, img_cols, 1)

x_train = x_train.astype('float32')
x_test = x_test.astype('float32')
x_train /= 255
x_test /= 255
print('x_train shape:', x_train.shape)
print(x_train.shape[0], 'train samples')
print(x_test.shape[0], 'test samples')

# convert class vectors to binary class matrices
y_train = keras.utils.to_categorical(y_train, num_classes)
y_test = keras.utils.to_categorical(y_test, num_classes)

model = keras.models.Sequential()
model.add(keras.layers.Conv2D(32, kernel_size=(3, 3),
activation='relu',
input_shape=input_shape))
model.add(keras.layers.Conv2D(64, (3, 3), activation='relu'))
model.add(keras.layers.MaxPooling2D(pool_size=(2, 2)))
model.add(keras.layers.Dropout(0.25))
model.add(keras.layers.Flatten())
model.add(keras.layers.Dense(128, activation='relu'))
model.add(keras.layers.Dropout(0.5))
model.add(keras.layers.Dense(num_classes, activation='softmax'))

model.compile(loss=keras.losses.categorical_crossentropy,
optimizer=keras.optimizers.Adadelta(),
metrics=['accuracy'])
profiler = cProfile.Profile()
profiler.enable()
logger.info('fit')
model.fit(x_train, y_train,
batch_size=batch_size,
epochs=epochs,
verbose=1,
validation_data=(x_test, y_test), callbacks=[callback()])
profiler.dump_stats(os.path.expanduser('~/profiler.pstats'))
score = model.evaluate(x_test, y_test, verbose=0)

print('Test loss:', score[0])
print('Test accuracy:', score[1])

最佳答案

使用使用 CUDA 9 构建的 tensorflow 版本似乎几乎完全解决了这个问题:https://github.com/mind/wheels/releases/tag/tf1.4-gpu-cuda9

使用此版本还需要安装 MKL 库 - 说明如下:https://software.intel.com/en-us/articles/intel-mkl-dnn-part-1-library-overview-and-installation

解释为什么会发生这种情况,或者不涉及修改版本的 tensorflow 的解决方案仍然是首选

关于python - keras 启动时间 (_make_train_function()) 在 Tesla V100-SXM2-16GB GPU 上非常慢,与功能较弱的 GPU 相比,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/47375416/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com