gpt4 book ai didi

tensorflow - 了解 ResourceExhaustedError : OOM when allocating tensor with shape

转载 作者:行者123 更新时间:2023-12-03 13:27:26 27 4
gpt4 key购买 nike

我正在尝试使用 tensorflow 实现跳过思想模型,当前版本已放置 here .
enter image description here

目前我使用我机器的一个 GPU(总共 2 个 GPU)并且 GPU 信息是

2017-09-06 11:29:32.657299: I tensorflow/core/common_runtime/gpu/gpu_device.cc:940] Found device 0 with properties:
name: GeForce GTX 1080 Ti
major: 6 minor: 1 memoryClockRate (GHz) 1.683
pciBusID 0000:02:00.0
Total memory: 10.91GiB
Free memory: 10.75GiB

但是,当我尝试向模型提供数据时出现了 OOM。我尝试调试如下:

我在运行 sess.run(tf.global_variables_initializer()) 后立即使用以下代码段
    logger.info('Total: {} params'.format(
np.sum([
np.prod(v.get_shape().as_list())
for v in tf.trainable_variables()
])))

并得到 2017-09-06 11:29:51,333 INFO main main.py:127 - Total: 62968629 params ,大概是 240Mb如果都使用 tf.float32 . tf.global_variables的输出是
[<tf.Variable 'embedding/embedding_matrix:0' shape=(155229, 200) dtype=float32_ref>,
<tf.Variable 'encoder/rnn/gru_cell/gates/kernel:0' shape=(400, 400) dtype=float32_ref>,
<tf.Variable 'encoder/rnn/gru_cell/gates/bias:0' shape=(400,) dtype=float32_ref>,
<tf.Variable 'encoder/rnn/gru_cell/candidate/kernel:0' shape=(400, 200) dtype=float32_ref>,
<tf.Variable 'encoder/rnn/gru_cell/candidate/bias:0' shape=(200,) dtype=float32_ref>,
<tf.Variable 'decoder/weights:0' shape=(200, 155229) dtype=float32_ref>,
<tf.Variable 'decoder/biases:0' shape=(155229,) dtype=float32_ref>,
<tf.Variable 'decoder/previous_decoder/rnn/gru_cell/gates/kernel:0' shape=(400, 400) dtype=float32_ref>,
<tf.Variable 'decoder/previous_decoder/rnn/gru_cell/gates/bias:0' shape=(400,) dtype=float32_ref>,
<tf.Variable 'decoder/previous_decoder/rnn/gru_cell/candidate/kernel:0' shape=(400, 200) dtype=float32_ref>,
<tf.Variable 'decoder/previous_decoder/rnn/gru_cell/candidate/bias:0' shape=(200,) dtype=float32_ref>,
<tf.Variable 'decoder/next_decoder/rnn/gru_cell/gates/kernel:0' shape=(400, 400) dtype=float32_ref>,
<tf.Variable 'decoder/next_decoder/rnn/gru_cell/gates/bias:0' shape=(400,) dtype=float32_ref>,
<tf.Variable 'decoder/next_decoder/rnn/gru_cell/candidate/kernel:0' shape=(400, 200) dtype=float32_ref>,
<tf.Variable 'decoder/next_decoder/rnn/gru_cell/candidate/bias:0' shape=(200,) dtype=float32_ref>,
<tf.Variable 'global_step:0' shape=() dtype=int32_ref>]

在我的训练短语中,我有一个数据数组,其形状为 (164652, 3, 30) ,即 sample_size x 3 x time_step , 3这里是指上一句、当前句和下一句。这个训练数据的大小约为 57Mb并存储在 loader 中.然后我用写一个生成器函数来获取句子,看起来像
def iter_batches(self, batch_size=128, time_major=True, shuffle=True):

num_samples = len(self._sentences)
if shuffle:
samples = self._sentences[np.random.permutation(num_samples)]
else:
samples = self._sentences

batch_start = 0
while batch_start < num_samples:
batch = samples[batch_start:batch_start + batch_size]

lens = (batch != self._vocab[self._vocab.pad_token]).sum(axis=2)
y, x, z = batch[:, 0, :], batch[:, 1, :], batch[:, 2, :]
if time_major:
yield (y.T, lens[:, 0]), (x.T, lens[:, 1]), (z.T, lens[:, 2])
else:
yield (y, lens[:, 0]), (x, lens[:, 1]), (z, lens[:, 2])
batch_start += batch_size

训练循环看起来像
for epoch in num_epochs:
batches = loader.iter_batches(batch_size=args.batch_size)
try:
(y, y_lens), (x, x_lens), (z, z_lens) = next(batches)
_, summaries, loss_val = sess.run(
[train_op, train_summary_op, st.loss],
feed_dict={
st.inputs: x,
st.sequence_length: x_lens,
st.previous_targets: y,
st.previous_target_lengths: y_lens,
st.next_targets: z,
st.next_target_lengths: z_lens
})
except StopIteraton:
...

然后我得到了一个OOM。如果我注释掉整个 try正文(不提供数据),脚本运行得很好。

我不知道为什么我在这么小的数据规模上得到了 OOM。使用 nvidia-smi我总是得到
Wed Sep  6 12:03:37 2017
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 384.59 Driver Version: 384.59 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce GTX 108... Off | 00000000:02:00.0 Off | N/A |
| 0% 44C P2 60W / 275W | 10623MiB / 11172MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 1 GeForce GTX 108... Off | 00000000:03:00.0 Off | N/A |
| 0% 43C P2 62W / 275W | 10621MiB / 11171MiB | 0% Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 32748 C python3 10613MiB |
| 1 32748 C python3 10611MiB |
+-----------------------------------------------------------------------------+

我看不到脚本的实际 GPU 使用情况,因为 tensorflow 总是在开始时窃取所有内存。这里的实际问题是我不知道如何调试它。

我在 StackOverflow 上阅读了一些关于 OOM 的帖子。大多数情况发生在将大量测试集数据馈送到模型时,通过小批量馈送数据可以避免该问题。但我不明白为什么在我的 11Gb 1080Ti 中看到如此小的数据和参数组合很糟糕,因为它只是尝试分配大小为 [3840 x 155229] 的矩阵的错误。 . (解码器的输出矩阵, 3840 = 30(time_steps) x 128(batch_size)155229 是 vocab_size)。
2017-09-06 12:14:45.787566: W tensorflow/core/common_runtime/bfc_allocator.cc:277] ********************************************************************************************xxxxxxxx
2017-09-06 12:14:45.787597: W tensorflow/core/framework/op_kernel.cc:1158] Resource exhausted: OOM when allocating tensor with shape[3840,155229]
2017-09-06 12:14:45.788735: W tensorflow/core/framework/op_kernel.cc:1158] Resource exhausted: OOM when allocating tensor with shape[3840,155229]
[[Node: decoder/previous_decoder/Add = Add[T=DT_FLOAT, _device="/job:localhost/replica:0/task:0/gpu:0"](decoder/previous_decoder/MatMul, decoder/biases/read)]]
2017-09-06 12:14:45.790453: I tensorflow/core/common_runtime/gpu/pool_allocator.cc:247] PoolAllocator: After 2857 get requests, put_count=2078 evicted_count=1000 eviction_rate=0.481232 and unsatisfied allocation rate=0.657683
2017-09-06 12:14:45.790482: I tensorflow/core/common_runtime/gpu/pool_allocator.cc:259] Raising pool_size_limit_ from 100 to 110
Traceback (most recent call last):
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py", line 1139, in _do_call
return fn(*args)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py", line 1121, in _run_fn
status, run_metadata)
File "/usr/lib/python3.6/contextlib.py", line 88, in __exit__
next(self.gen)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/errors_impl.py", line 466, in raise_exception_on_not_ok_status
pywrap_tensorflow.TF_GetCode(status))
tensorflow.python.framework.errors_impl.ResourceExhaustedError: OOM when allocating tensor with shape[3840,155229]
[[Node: decoder/previous_decoder/Add = Add[T=DT_FLOAT, _device="/job:localhost/replica:0/task:0/gpu:0"](decoder/previous_decoder/MatMul, decoder/biases/read)]]
[[Node: GradientDescent/update/_146 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/cpu:0", send_device="/job:localhost/replica:0/task:0/gpu:0", send_device_incarnation=1, tensor_name="edge_2166_GradientDescent/update", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/cpu:0"]()]]

During handling of the above exception, another exception occurred:

任何帮助将不胜感激。提前致谢。

最佳答案

让我们一一分解问题:

关于 tensorflow 提前分配所有内存,您可以使用以下代码片段让 tensorflow 在需要时分配内存。这样您就可以了解事情的进展情况。

gpu_options = tf.GPUOptions(allow_growth=True)
session = tf.InteractiveSession(config=tf.ConfigProto(gpu_options=gpu_options))

这同样适用于 tf.Session()而不是 tf.InteractiveSession()若你宁可。

关于尺寸的第二件事,
由于没有关于您的网络规模的信息,我们无法估计出了什么问题。但是,您也可以逐步调试所有网络。例如,创建一个只有一层的网络,获取其输出,一次创建 session 和馈送值,并可视化您消耗了多少内存。迭代此调试 session ,直到您看到内存不足的点。

请注意,3840 x 155229 输出确实是一个很大的输出。这意味着约 600M 神经元,每层仅约 2.22GB。如果你有任何类似大小的层,它们都会加起来非常快地填满你的 GPU 内存。

此外,这仅适用于前向方向,如果您使用此层进行训练,优化器添加的反向传播和层数将乘以 2。因此,对于训练,您仅为输出层消耗了约 5 GB。

我建议您修改您的网络并尝试减少批量大小/参数数量以使您的模型适合 GPU

关于tensorflow - 了解 ResourceExhaustedError : OOM when allocating tensor with shape,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/46066850/

27 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com