gpt4 book ai didi

卡住图后的 Tensorflow OOM

转载 作者:行者123 更新时间:2023-12-03 17:58:20 26 4
gpt4 key购买 nike

我正在使用 tf 运行 seq2seq 模型,当使用 tf.train.Saver 从检查点文件加载参数时,推理程序运行良好。但是在使用 freeze_graph.py(使用 tf.framework.graph_util.convert_variables_to_constants())导出图形后,使用 tf.import_graph_def 导入推理程序,出现OOM问题。

这是错误日志的一部分:

W tensorflow/core/common_runtime/bfc_allocator.cc:274] ****************************************************************************************************
W tensorflow/core/common_runtime/bfc_allocator.cc:275] Ran out of memory trying to allocate 4.0KiB. See logs for memory state.
W tensorflow/core/framework/op_kernel.cc:983] Internal: Dst tensor is not initialized.
E tensorflow/core/common_runtime/executor.cc:594] Executor failed to create kernel. Internal: Dst tensor is not initialized.
[[Node: embedding_attention_seq2seq/embedding_attention_decoder/attention_decoder/AttnV_0 = Const[dtype=DT_FLOAT, value=Tensor<type: float shape: [1024] values: -0.016628871 -0.2054652 -0.045054652...>, _device="/job:localhost/replica:0/task:0/gpu:0"]()]]
Traceback (most recent call last):
File "inference.py", line 88, in console_main
result = list(inference(source_sentence))
File "inference.py", line 54, in inference
for sequence in result:
File "/data/experiment/decoder.py", line 115, in search_best_sequence
State.batch_predict(self.session, self.model, self.context, beam)
File "/data/experiment/decoder.py", line 82, in batch_predict
state_list[0].depth)
File "/data/experiment/seq2seq_model.py", line 452, in batch_feed_decoder
log_softmax, attns, state = session.run(output_fetch, input_feed)
File "/home/.conda/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 767, in run
run_metadata_ptr)
File "/home/.conda/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 966, in _run
feed_dict_string, options, run_metadata)
File "/home/.conda/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 1016, in _do_run
target_list, options, run_metadata)
File "/home/.conda/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 1036, in _do_call
raise type(e)(node_def, op, message)
InternalError: Dst tensor is not initialized.
[[Node: embedding_attention_seq2seq/embedding_attention_decoder/attention_decoder/AttnV_0 = Const[dtype=DT_FLOAT, value=Tensor<type: float shape: [1024] values: -0.016628871 -0.2054652 -0.045054652...>, _device="/job:localhost/replica:0/task:0/gpu:0"]()]]

Caused by op u'embedding_attention_seq2seq/embedding_attention_decoder/attention_decoder/AttnV_0', defined at:
File "inference.py", line 169, in <module>
tf.app.run()
File "/home/.conda/lib/python2.7/site-packages/tensorflow/python/platform/app.py", line 44, in run
_sys.exit(main(_sys.argv[:1] + flags_passthrough))
File "inference.py", line 165, in main
console_main(session)
File "inference.py", line 66, in console_main
model = create_model(session, False)
File "/data/experiment/model.py", line 145, in create_model
tensor_name_pickle=tensor_name_pickle)
File "/data/experiment/seq2seq_model.py", line 106, in __init__
tf.import_graph_def(graph_def, name="")
File "/home/.conda/lib/python2.7/site-packages/tensorflow/python/framework/importer.py", line 287, in import_graph_def
op_def=op_def)
File "/home/.conda/lib/python2.7/site-packages/tensorflow/python/framework/ops.py", line 2395, in create_op
original_op=self._default_original_op, op_def=op_def)
File "/home/.conda/lib/python2.7/site-packages/tensorflow/python/framework/ops.py", line 1264, in __init__
self._traceback = _extract_stack()

InternalError (see above for traceback): Dst tensor is not initialized.
[[Node: embedding_attention_seq2seq/embedding_attention_decoder/attention_decoder/AttnV_0 = Const[dtype=DT_FLOAT, value=Tensor<type: float shape: [1024] values: -0.016628871 -0.2054652 -0.045054652...>, _device="/job:localhost/replica:0/task:0/gpu:0"]()]]

我认为这可能是 tf.Constant 的内存问题引起的。有人遇到过这个问题吗?

最佳答案

我遇到了同样的问题,但是当我尝试使用 C API 从 C++ 应用程序加载和运行推理时。经过大量调整和测试,罪魁祸首似乎是卡住图和 freeze_graph.py 本身。这可能是某种错误。 github 的 TF repo 上实际上有多个问题报告,但由于缺乏事件而被关闭,例如herehere .我想模型卡住的明显错误不是任何优先事项。

在我的例子中,模型 .pb 文件大约有 500mb,并且在运行 session 时占用了大约 10Gb 的 RAM。它不仅占用了大量的 RAM,而且实际上还慢了几个数量级。

当我切换到仅加载一个 SavedModel 目录时,一切正常。我不确定如何在 python 中实现这一点,但对于 C 代码,我用 TF_LoadSessionFromSavedModel() 替换了 TF_GraphImportGraphDef() 调用。

我使用的是 TF v1.14.0。该库是我用 Bazel 构建的,而不是库存版本。如果有人感兴趣,我可以在这里和那里提供一些细节。只是不确定从哪里开始,我有很多试验和错误。

关于卡住图后的 Tensorflow OOM,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/42644658/

26 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com