gpt4 book ai didi

tensorflow - Model.fit() 是否将整个训练数据集上传到 GPU?

转载 作者:行者123 更新时间:2023-12-05 03:57:22 25 4
gpt4 key购买 nike

我正在使用 keras API、tensorflow 后端在几个 GB 数据集上训练 LSTM。在某些内存中数据 (numpy) 上运行 Model.fit() 时,它会在一个请求中分配 8GB 内存,这在仅加载一小部分数据时不会发生。我的 GPU 不能同时使用模型参数和 8GB,它会耗尽内存并停止。我很确定这在我从 TF2 beta 升级到 TF2rc 后开始发生。以下是我所说的适合度:

tb = tf.keras.callbacks.TensorBoard(log_dir=log_dir, histogram_freq=1)
es = keras.callbacks.EarlyStopping(monitor='val_loss', mode='min', verbose=1, patience=patience*2, restore_best_weights=True)
lr_reduce = keras.callbacks.ReduceLROnPlateau(factor=0.1, patience=patience, verbose=1)
chkpointing = keras.callbacks.ModelCheckpoint(weight_fname, monitor='val_loss', verbose=0, save_best_only=True,
save_weights_only=True, mode='auto')

model.fit(train_data_x, train_data_y, validation_data=(test_data_x, test_data_y), batch_size=cfg['batch_size'],
epochs=nepochs, validation_freq=1, callbacks=[lr_reduce, es, tb, chkpointing],
class_weight=cfg['class_weight'], shuffle=True)

是否打算在 GPU 上为整个数据集分配空间?我怎样才能防止它发生?

编辑:

更新了代码以限制内存分配。它确实限制了它,因为它表明 TF 可以访问比以前更少的内存,但它仍然尝试分配 8.14GB。下面是我如何限制内存和选择 GPU:

def select_gpu(gpu_id=-1, max_usage=.5):  # max 2 gpu only
os.environ["CUDA_VISIBLE_DEVICES"] = str(gpu_id) if gpu_id != -1 else '0,1'
gpus = tf.config.experimental.list_physical_devices('GPU')
max_memory = 11534 # MB got from: grep -i --color memory /var/log/Xorg.0.log
for gpu in gpus:
print('GPU FOUND:', gpu)
tf.config.experimental.set_memory_growth(gpu, True) # FIXME true
tf.config.experimental.set_virtual_device_configuration(gpu,
[tf.config.experimental.VirtualDeviceConfiguration(memory_limit=max_memory * max_usage)])
print('RUNNING ON GPU #{}'.format(gpu_id))

# ... just call select_gpu(0) in the beginning of the script

这是错误:

_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
time_distributed (TimeDistri (None, 42, 256) 7168
_________________________________________________________________
cu_dnnlstm (CuDNNLSTM) (None, 42, 256) 526336
_________________________________________________________________
cu_dnnlstm_1 (CuDNNLSTM) (None, 42, 256) 526336
_________________________________________________________________
cu_dnnlstm_2 (CuDNNLSTM) (None, 42, 256) 526336
_________________________________________________________________
cu_dnnlstm_3 (CuDNNLSTM) (None, 42, 256) 526336
_________________________________________________________________
cu_dnnlstm_4 (CuDNNLSTM) (None, 42, 256) 526336
_________________________________________________________________
cu_dnnlstm_5 (CuDNNLSTM) (None, 256) 526336
_________________________________________________________________
dense_1 (Dense) (None, 256) 65792
_________________________________________________________________
dense_2 (Dense) (None, 1) 257
=================================================================
Total params: 3,231,233
Trainable params: 3,231,233
Non-trainable params: 0
_________________________________________________________________
None
2019-10-27 12:36:48.833843: W tensorflow/core/common_runtime/bfc_allocator.cc:419] Allocator (GPU_0_bfc) ran out of memory trying to allocate 8.14GiB (rounded to 8738821888). Current allocation summary follows.
2019-10-27 12:36:48.833927: I tensorflow/core/common_runtime/bfc_allocator.cc:869] Bin (256): Total Chunks: 16, Chunks in use: 15. 4.0KiB allocated for chunks. 3.8KiB in use in bin. 72B client-requested in use in bin.
2019-10-27 12:36:48.833944: I tensorflow/core/common_runtime/bfc_allocator.cc:869] Bin (512): Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2019-10-27 12:36:48.833958: I tensorflow/core/common_runtime/bfc_allocator.cc:869] Bin (1024): Total Chunks: 5, Chunks in use: 4. 5.5KiB allocated for chunks. 4.2KiB in use in bin. 4.0KiB client-requested in use in bin.
2019-10-27 12:36:48.833970: I tensorflow/core/common_runtime/bfc_allocator.cc:869] Bin (2048): Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2019-10-27 12:36:48.833982: I tensorflow/core/common_runtime/bfc_allocator.cc:869] Bin (4096): Total Chunks: 1, Chunks in use: 0. 4.8KiB allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2019-10-27 12:36:48.833998: I tensorflow/core/common_runtime/bfc_allocator.cc:869] Bin (8192): Total Chunks: 6, Chunks in use: 6. 49.8KiB allocated for chunks. 49.8KiB in use in bin. 48.0KiB client-requested in use in bin.
2019-10-27 12:36:48.834012: I tensorflow/core/common_runtime/bfc_allocator.cc:869] Bin (16384): Total Chunks: 1, Chunks in use: 1. 27.0KiB allocated for chunks. 27.0KiB in use in bin. 27.0KiB client-requested in use in bin.
2019-10-27 12:36:48.834023: I tensorflow/core/common_runtime/bfc_allocator.cc:869] Bin (32768): Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2019-10-27 12:36:48.834034: I tensorflow/core/common_runtime/bfc_allocator.cc:869] Bin (65536): Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2019-10-27 12:36:48.834045: I tensorflow/core/common_runtime/bfc_allocator.cc:869] Bin (131072): Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2019-10-27 12:36:48.834060: I tensorflow/core/common_runtime/bfc_allocator.cc:869] Bin (262144): Total Chunks: 1, Chunks in use: 1. 504.0KiB allocated for chunks. 504.0KiB in use in bin. 256.0KiB client-requested in use in bin.
2019-10-27 12:36:48.834073: I tensorflow/core/common_runtime/bfc_allocator.cc:869] Bin (524288): Total Chunks: 1, Chunks in use: 0. 512.0KiB allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2019-10-27 12:36:48.834088: I tensorflow/core/common_runtime/bfc_allocator.cc:869] Bin (1048576): Total Chunks: 12, Chunks in use: 12. 12.00MiB allocated for chunks. 12.00MiB in use in bin. 12.00MiB client-requested in use in bin.
2019-10-27 12:36:48.834099: I tensorflow/core/common_runtime/bfc_allocator.cc:869] Bin (2097152): Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2019-10-27 12:36:48.834110: I tensorflow/core/common_runtime/bfc_allocator.cc:869] Bin (4194304): Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2019-10-27 12:36:48.834122: I tensorflow/core/common_runtime/bfc_allocator.cc:869] Bin (8388608): Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2019-10-27 12:36:48.834132: I tensorflow/core/common_runtime/bfc_allocator.cc:869] Bin (16777216): Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2019-10-27 12:36:48.834143: I tensorflow/core/common_runtime/bfc_allocator.cc:869] Bin (33554432): Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2019-10-27 12:36:48.834156: I tensorflow/core/common_runtime/bfc_allocator.cc:869] Bin (67108864): Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2019-10-27 12:36:48.834167: I tensorflow/core/common_runtime/bfc_allocator.cc:869] Bin (134217728): Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2019-10-27 12:36:48.834180: I tensorflow/core/common_runtime/bfc_allocator.cc:869] Bin (268435456): Total Chunks: 1, Chunks in use: 0. 4.49GiB allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2019-10-27 12:36:48.834193: I tensorflow/core/common_runtime/bfc_allocator.cc:885] Bin for 8.14GiB was 256.00MiB, Chunk State:
2019-10-27 12:36:48.834213: I tensorflow/core/common_runtime/bfc_allocator.cc:891] Size: 4.49GiB | Requested Size: 1.00MiB | in_use: 0 | bin_num: 20, prev: Size: 1.00MiB | Requested Size: 1.00MiB | in_use: 1 | bin_num: -1
2019-10-27 12:36:48.834223: I tensorflow/core/common_runtime/bfc_allocator.cc:898] Next region of size 4837081088
2019-10-27 12:36:48.834237: I tensorflow/core/common_runtime/bfc_allocator.cc:905] InUse at 0x7f3cf6000000 next 1 of size 256
2019-10-27 12:36:48.834247: I tensorflow/core/common_runtime/bfc_allocator.cc:905] InUse at 0x7f3cf6000100 next 2 of size 256
2019-10-27 12:36:48.834257: I tensorflow/core/common_runtime/bfc_allocator.cc:905] InUse at 0x7f3cf6000200 next 3 of size 1280
2019-10-27 12:36:48.834267: I tensorflow/core/common_runtime/bfc_allocator.cc:905] InUse at 0x7f3cf6000700 next 4 of size 256
2019-10-27 12:36:48.834277: I tensorflow/core/common_runtime/bfc_allocator.cc:905] InUse at 0x7f3cf6000800 next 5 of size 1024
2019-10-27 12:36:48.834287: I tensorflow/core/common_runtime/bfc_allocator.cc:905] InUse at 0x7f3cf6000c00 next 8 of size 256
2019-10-27 12:36:48.834296: I tensorflow/core/common_runtime/bfc_allocator.cc:905] InUse at 0x7f3cf6000d00 next 9 of size 256
2019-10-27 12:36:48.834306: I tensorflow/core/common_runtime/bfc_allocator.cc:905] InUse at 0x7f3cf6000e00 next 10 of size 256
2019-10-27 12:36:48.834316: I tensorflow/core/common_runtime/bfc_allocator.cc:905] InUse at 0x7f3cf6000f00 next 13 of size 256
2019-10-27 12:36:48.834325: I tensorflow/core/common_runtime/bfc_allocator.cc:905] InUse at 0x7f3cf6001000 next 34 of size 256
2019-10-27 12:36:48.834335: I tensorflow/core/common_runtime/bfc_allocator.cc:905] InUse at 0x7f3cf6001100 next 35 of size 256
2019-10-27 12:36:48.834344: I tensorflow/core/common_runtime/bfc_allocator.cc:905] InUse at 0x7f3cf6001200 next 37 of size 256
2019-10-27 12:36:48.834354: I tensorflow/core/common_runtime/bfc_allocator.cc:905] InUse at 0x7f3cf6001300 next 16 of size 256
2019-10-27 12:36:48.834363: I tensorflow/core/common_runtime/bfc_allocator.cc:905] InUse at 0x7f3cf6001400 next 14 of size 256
2019-10-27 12:36:48.834373: I tensorflow/core/common_runtime/bfc_allocator.cc:905] Free at 0x7f3cf6001500 next 40 of size 1280
2019-10-27 12:36:48.834382: I tensorflow/core/common_runtime/bfc_allocator.cc:905] InUse at 0x7f3cf6001a00 next 41 of size 1024
2019-10-27 12:36:48.834392: I tensorflow/core/common_runtime/bfc_allocator.cc:905] Free at 0x7f3cf6001e00 next 18 of size 4864
2019-10-27 12:36:48.834402: I tensorflow/core/common_runtime/bfc_allocator.cc:905] InUse at 0x7f3cf6003100 next 19 of size 8192
2019-10-27 12:36:48.834411: I tensorflow/core/common_runtime/bfc_allocator.cc:905] InUse at 0x7f3cf6005100 next 36 of size 1024
2019-10-27 12:36:48.834420: I tensorflow/core/common_runtime/bfc_allocator.cc:905] InUse at 0x7f3cf6005500 next 39 of size 256
2019-10-27 12:36:48.834430: I tensorflow/core/common_runtime/bfc_allocator.cc:905] InUse at 0x7f3cf6005600 next 42 of size 256
2019-10-27 12:36:48.834439: I tensorflow/core/common_runtime/bfc_allocator.cc:905] InUse at 0x7f3cf6005700 next 43 of size 256
2019-10-27 12:36:48.834449: I tensorflow/core/common_runtime/bfc_allocator.cc:905] Free at 0x7f3cf6005800 next 21 of size 256
2019-10-27 12:36:48.834459: I tensorflow/core/common_runtime/bfc_allocator.cc:905] InUse at 0x7f3cf6005900 next 22 of size 8192
2019-10-27 12:36:48.834469: I tensorflow/core/common_runtime/bfc_allocator.cc:905] InUse at 0x7f3cf6007900 next 25 of size 8192
2019-10-27 12:36:48.834478: I tensorflow/core/common_runtime/bfc_allocator.cc:905] InUse at 0x7f3cf6009900 next 28 of size 8192
2019-10-27 12:36:48.834488: I tensorflow/core/common_runtime/bfc_allocator.cc:905] InUse at 0x7f3cf600b900 next 6 of size 9984
2019-10-27 12:36:48.834500: I tensorflow/core/common_runtime/bfc_allocator.cc:905] InUse at 0x7f3cf600e000 next 7 of size 27648
2019-10-27 12:36:48.834509: I tensorflow/core/common_runtime/bfc_allocator.cc:905] InUse at 0x7f3cf6014c00 next 33 of size 8192
2019-10-27 12:36:48.834519: I tensorflow/core/common_runtime/bfc_allocator.cc:905] Free at 0x7f3cf6016c00 next 38 of size 524288
2019-10-27 12:36:48.834528: I tensorflow/core/common_runtime/bfc_allocator.cc:905] InUse at 0x7f3cf6096c00 next 17 of size 516096
2019-10-27 12:36:48.834538: I tensorflow/core/common_runtime/bfc_allocator.cc:905] InUse at 0x7f3cf6114c00 next 12 of size 1048576
2019-10-27 12:36:48.834548: I tensorflow/core/common_runtime/bfc_allocator.cc:905] InUse at 0x7f3cf6214c00 next 11 of size 1048576
2019-10-27 12:36:48.834558: I tensorflow/core/common_runtime/bfc_allocator.cc:905] InUse at 0x7f3cf6314c00 next 20 of size 1048576
2019-10-27 12:36:48.834567: I tensorflow/core/common_runtime/bfc_allocator.cc:905] InUse at 0x7f3cf6414c00 next 15 of size 1048576
2019-10-27 12:36:48.834577: I tensorflow/core/common_runtime/bfc_allocator.cc:905] InUse at 0x7f3cf6514c00 next 24 of size 1048576
2019-10-27 12:36:48.834586: I tensorflow/core/common_runtime/bfc_allocator.cc:905] InUse at 0x7f3cf6614c00 next 23 of size 1048576
2019-10-27 12:36:48.834595: I tensorflow/core/common_runtime/bfc_allocator.cc:905] InUse at 0x7f3cf6714c00 next 27 of size 1048576
2019-10-27 12:36:48.834605: I tensorflow/core/common_runtime/bfc_allocator.cc:905] InUse at 0x7f3cf6814c00 next 26 of size 1048576
2019-10-27 12:36:48.834614: I tensorflow/core/common_runtime/bfc_allocator.cc:905] InUse at 0x7f3cf6914c00 next 30 of size 1048576
2019-10-27 12:36:48.834623: I tensorflow/core/common_runtime/bfc_allocator.cc:905] InUse at 0x7f3cf6a14c00 next 29 of size 1048576
2019-10-27 12:36:48.834633: I tensorflow/core/common_runtime/bfc_allocator.cc:905] InUse at 0x7f3cf6b14c00 next 32 of size 1048576
2019-10-27 12:36:48.834642: I tensorflow/core/common_runtime/bfc_allocator.cc:905] InUse at 0x7f3cf6c14c00 next 31 of size 1048576
2019-10-27 12:36:48.834652: I tensorflow/core/common_runtime/bfc_allocator.cc:905] Free at 0x7f3cf6d14c00 next 18446744073709551615 of size 4823364608
2019-10-27 12:36:48.834661: I tensorflow/core/common_runtime/bfc_allocator.cc:914] Summary of in-use Chunks by size:
2019-10-27 12:36:48.834673: I tensorflow/core/common_runtime/bfc_allocator.cc:917] 15 Chunks of size 256 totalling 3.8KiB
2019-10-27 12:36:48.834684: I tensorflow/core/common_runtime/bfc_allocator.cc:917] 3 Chunks of size 1024 totalling 3.0KiB
2019-10-27 12:36:48.834694: I tensorflow/core/common_runtime/bfc_allocator.cc:917] 1 Chunks of size 1280 totalling 1.2KiB
2019-10-27 12:36:48.834706: I tensorflow/core/common_runtime/bfc_allocator.cc:917] 5 Chunks of size 8192 totalling 40.0KiB
2019-10-27 12:36:48.834715: I tensorflow/core/common_runtime/bfc_allocator.cc:917] 1 Chunks of size 9984 totalling 9.8KiB
2019-10-27 12:36:48.834726: I tensorflow/core/common_runtime/bfc_allocator.cc:917] 1 Chunks of size 27648 totalling 27.0KiB
2019-10-27 12:36:48.834736: I tensorflow/core/common_runtime/bfc_allocator.cc:917] 1 Chunks of size 516096 totalling 504.0KiB
2019-10-27 12:36:48.834747: I tensorflow/core/common_runtime/bfc_allocator.cc:917] 12 Chunks of size 1048576 totalling 12.00MiB
2019-10-27 12:36:48.834759: I tensorflow/core/common_runtime/bfc_allocator.cc:921] Sum Total of in-use chunks: 12.57MiB
2019-10-27 12:36:48.834769: I tensorflow/core/common_runtime/bfc_allocator.cc:923] total_region_allocated_bytes_: 4837081088 memory_limit_: 4837081088 available bytes: 0 curr_region_allocation_bytes_: 9674162176
2019-10-27 12:36:48.834784: I tensorflow/core/common_runtime/bfc_allocator.cc:929] Stats:
Limit: 4837081088
InUse: 13185792
MaxInUse: 14756864
NumAllocs: 186
MaxAllocSize: 1048576

你可以看到我的模型很小,不需要接近 8GB 的​​内存。

编辑#2:

我刚刚恢复到 TF2 beta (tensorflow-gpu==2.0.0-beta1),问题就消失了。希望我们能找到比这更好的解决方案。

最佳答案

这是 TensorFlow 的默认行为,分配的比实际需要的多 - 虽然它可能不完全是正在分配的数据集,但您只需要模型和 TF/Keras session 中的即时张量/数据, 在 TF2 中完成:

max_memory = 8000 # dedicated memory in MB; run 'dxdiag' to get exact figure
max_usage = 0.95 * max_memory # example for using up to 95%

gpus = tf.config.experimental.list_physical_devices('GPU')
tf.config.experimental.set_virtual_device_configuration(
gpus[0],
[tf.config.experimental.VirtualDeviceConfiguration(memory_limit=max_usage)])

另请参阅 limiting GPU memory growth 上的 TensorFlow 文档, 和 relevant Git .


更新:TF2 eager 似乎有一个已知的内存管理问题 - 作为解决方法,禁用它在 Eager 中工作,它可以运行得更快 - 参见 details here :

tf.compat.v1.disable_eager_execution()

关于tensorflow - Model.fit() 是否将整个训练数据集上传到 GPU?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/58575279/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com