python - 在 GPU 上使用 tf.train.Saver() 导致 Tensorflow 崩溃-6ren

python - 在 GPU 上使用 tf.train.Saver() 导致 Tensorflow 崩溃

转载作者：太空宇宙更新时间：2023-11-03 16:23:50

我什至不知道如何解决这个问题，或者要搜索什么，但是当我在 GPU 上运行一些代码时，当使用 tf.train.Saver 对象来跟踪可变状态。当我注释掉 Saver 实例化或切换到 CPU:0 时，代码运行良好。

  File "entrypoint.py", line 496, in <module>
    online_mvrcca_multipie_test3()
  File "entrypoint.py", line 490, in online_mvrcca_multipie_test3
    gs_res = gridsearch_optimizer_cb(parameter_ranges,exp_f_handle);
  File "/homes/sj16/LPLUSS/deps/sjpy_utils/exptools/parameter_search.py", line 48, in gridsearch_optimizer_async
    f_handle(parameter_instance);
  File "entrypoint.py", line 487, in <lambda>
    {}\
  File "/homes/sj16/LPLUSS/deps/pyena/src/sessions.py", line 115, in submit_to_local_session
    run_metadata_ptr)
  File "/homes/sj16/miniconda/envs/tensorflow27/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 636, in _run
    feed_dict_string, options, run_metadata)
  File "/homes/sj16/miniconda/envs/tensorflow27/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 708, in _do_run
    target_list, options, run_metadata)
  File "/homes/sj16/miniconda/envs/tensorflow27/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 728, in _do_call
    raise type(e)(node_def, op, message)
tensorflow.python.framework.errors.InvalidArgumentError: Cannot assign a device to node 'save/Const': Could not satisfy explicit device specification '/device:GPU:0' because no supported kernel for GPU devices is available.
Colocation Debug Info:
Colocation group had the following types and devices:
Identity: CPU
Const: CPU
         [[Node: save/Const = Const[dtype=DT_STRING, value=Tensor<type: string shape: [] values: model>, _device="/device:GPU:0"]()]]
Caused by op u'save/Const', defined at:
  File "entrypoint.py", line 496, in <module>
    online_mvrcca_multipie_test3()
  File "entrypoint.py", line 490, in online_mvrcca_multipie_test3
    gs_res = gridsearch_optimizer_cb(parameter_ranges,exp_f_handle);
  File "/homes/sj16/LPLUSS/deps/sjpy_utils/exptools/parameter_search.py", line 48, in gridsearch_optimizer_async
    f_handle(parameter_instance);
  File "entrypoint.py", line 487, in <lambda>
    {}\
  File "/homes/sj16/LPLUSS/deps/pyena/src/sessions.py", line 115, in submit_to_local_session
    worker_result=worker_task(*worker_args);
  File "/homes/sj16/LPLUSS/src/experiments/matrix_reconstruction/online/mvrcca_online/image_exp/experiment_workers.py", line 41, in batch_mv_recon_test_mc7
    saver = tf.train.Saver()   #Here is the offending call to Saver(), having set up the graph
  File "/homes/sj16/miniconda/envs/tensorflow27/lib/python2.7/site-packages/tensorflow/python/training/saver.py", line 845, in __init__
    restore_sequentially=restore_sequentially)
  File "/homes/sj16/miniconda/envs/tensorflow27/lib/python2.7/site-packages/tensorflow/python/training/saver.py", line 504, in build
    filename_tensor = constant_op.constant("model")
  File "/homes/sj16/miniconda/envs/tensorflow27/lib/python2.7/site-packages/tensorflow/python/ops/constant_op.py", line 166, in constant
    attrs={"value": tensor_value, "dtype": dtype_value}, name=name).outputs[0]
  File "/homes/sj16/miniconda/envs/tensorflow27/lib/python2.7/site-packages/tensorflow/python/framework/ops.py", line 2260, in create_op
    original_op=self._default_original_op, op_def=op_def)
  File "/homes/sj16/miniconda/envs/tensorflow27/lib/python2.7/site-packages/tensorflow/python/framework/ops.py", line 1230, in __init__
    self._traceback = _extract_stack()

我的意思是，如果您处于 GPU 模式，那么 TF 似乎无法将 tf.constant 保存到检查点文件中？因为没有 GPU 实现“内核”(不确定这在上下文中意味着什么)来执行 save/Const 节点(保存常量？)。

这有点奇怪......无法保存和恢复命名常量......

此外，我从不使用 tf.constant()，但我猜测当您使用数字/numpy 变量调用 tf.convert_to_tensor 时会创建一个 Constant 节点？

------------编辑以显示最小示例-----

环境:

CUDA 7.5.18，带 Tesla K40c；乌类图14.04； GPU Tensorflow 0.9.0rc0，使用python 2.7 miniconda环境

import os,math
import operator as op
import tensorflow as tf


with tf.device('/gpu:0'):
    tf_session=tf.Session()

    exp_model_dir= os.path.join(os.path.expanduser("~"),'tf_scratchpad/saver_failure_dense_only')

    if not os.path.isdir(exp_model_dir):
        os.mkdir(exp_model_dir)

    ranklim=10
    dense_widths=[64,ranklim,64, 128]

    # input to the network

    input_data = tf.placeholder(tf.float32, [1,128], name='input_data')

    current_input = input_data

    for layer_i, n_output in enumerate(dense_widths[0:]):

        n_input = int(current_input.get_shape()[1])
        W = tf.Variable(
            tf.random_uniform([n_input, n_output],
                              -1.0 / math.sqrt(n_input),
                              1.0 / math.sqrt(n_input)))
        b = tf.Variable(tf.zeros([n_output]))

        output = tf.nn.relu(tf.matmul(current_input, W) + b)
        current_input = output

    # reconstruction through the network
    y = current_input
    cost = tf.reduce_sum(tf.square(y - input_data))

    train_writer = tf.train.SummaryWriter(os.path.join(exp_model_dir,'train'),
                                          tf_session.graph)


    optimizer = tf.train.GradientDescentOptimizer(0.0075).minimize(cost)

    saver = tf.train.Saver()

    tf_session.run(tf.initialize_all_variables())

产生:

I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcublas.so locally
I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcudnn.so locally
I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcufft.so locally
I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcuda.so locally
I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcurand.so locally
I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:924] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
I tensorflow/core/common_runtime/gpu/gpu_init.cc:102] Found device 0 with properties:
name: Tesla K40c
major: 3 minor: 5 memoryClockRate (GHz) 0.745
pciBusID 0000:05:00.0
Total memory: 11.25GiB
Free memory: 11.15GiB
W tensorflow/stream_executor/cuda/cuda_driver.cc:572] creating context when one is currently active; existing: 0x2a95d80
I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:924] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
I tensorflow/core/common_runtime/gpu/gpu_init.cc:102] Found device 1 with properties:
name: Quadro K600
major: 3 minor: 0 memoryClockRate (GHz) 0.8755
pciBusID 0000:04:00.0
Total memory: 1023.31MiB
Free memory: 425.00MiB
I tensorflow/core/common_runtime/gpu/gpu_init.cc:59] cannot enable peer access from device ordinal 0 to device ordinal 1
I tensorflow/core/common_runtime/gpu/gpu_init.cc:59] cannot enable peer access from device ordinal 1 to device ordinal 0
I tensorflow/core/common_runtime/gpu/gpu_init.cc:126] DMA: 0 1
I tensorflow/core/common_runtime/gpu/gpu_init.cc:136] 0:   Y N
I tensorflow/core/common_runtime/gpu/gpu_init.cc:136] 1:   N Y
I tensorflow/core/common_runtime/gpu/gpu_device.cc:806] Creating TensorFlow device (/gpu:0) -> (device: 0, name: Tesla K40c, pci bus id: 0000:05:00.0)
I tensorflow/core/common_runtime/gpu/gpu_device.cc:793] Ignoring gpu device (device: 1, name: Quadro K600, pci bus id: 0000:04:00.0) with Cuda multiprocessor count: 1. The minimum required count is 8. You can adjust this requirement with the env var TF_MIN_GPU_MULTIPROCESSOR_COUNT.
Traceback (most recent call last):
  File "tfcrash.py", line 48, in <module>
    tf_session.run(tf.initialize_all_variables())
  File "/homes/sj16/miniconda/envs/tensorflow27/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 372, in run
    run_metadata_ptr)
  File "/homes/sj16/miniconda/envs/tensorflow27/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 636, in _run
    feed_dict_string, options, run_metadata)
  File "/homes/sj16/miniconda/envs/tensorflow27/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 708, in _do_run
    target_list, options, run_metadata)
  File "/homes/sj16/miniconda/envs/tensorflow27/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 728, in _do_call
    raise type(e)(node_def, op, message)
tensorflow.python.framework.errors.InvalidArgumentError: Cannot assign a device to node 'save/Const': Could not satisfy explicit device specification '/device:GPU:0' because no supported kernel for GPU devices is available.
Colocation Debug Info:
Colocation group had the following types and devices:
Identity: CPU
Const: CPU
   [[Node: save/Const = Const[dtype=DT_STRING, value=Tensor<type: string shape: [] values: model>, _device="/device:GPU:0"]()]]
Caused by op u'save/Const', defined at:
  File "tfcrash.py", line 46, in <module>
    saver = tf.train.Saver()
  File "/homes/sj16/miniconda/envs/tensorflow27/lib/python2.7/site-packages/tensorflow/python/training/saver.py", line 845, in __init__
    restore_sequentially=restore_sequentially)
  File "/homes/sj16/miniconda/envs/tensorflow27/lib/python2.7/site-packages/tensorflow/python/training/saver.py", line 504, in build
    filename_tensor = constant_op.constant("model")
  File "/homes/sj16/miniconda/envs/tensorflow27/lib/python2.7/site-packages/tensorflow/python/ops/constant_op.py", line 166, in constant
    attrs={"value": tensor_value, "dtype": dtype_value}, name=name).outputs[0]
  File "/homes/sj16/miniconda/envs/tensorflow27/lib/python2.7/site-packages/tensorflow/python/framework/ops.py", line 2260, in create_op
    original_op=self._default_original_op, op_def=op_def)
  File "/homes/sj16/miniconda/envs/tensorflow27/lib/python2.7/site-packages/tensorflow/python/framework/ops.py", line 1230, in __init__
    self._traceback = _extract_stack()

该错误实际上是在initialize_all_variables()处引发的，但归咎于对tf.train.Saver()的调用。注释掉 Saver() 调用或使用“/cpu:0”可以防止出现异常。

最佳答案

基本上，tf.train.Saver() 不应位于 with tf.device('/gpu:0') 下。

TensorFlow 中的每个操作都有其对设备的分配。并且 saver op 应始终分配给 cpu。

关于python - 在 GPU 上使用 tf.train.Saver() 导致 Tensorflow 崩溃，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/38173580/

文章推荐： Python 抓取 XHR 返回 ValueError : Too many values to unpack

文章推荐： ruby-on-rails - Rails with Postgres 为具有值的列返回 nil

文章推荐： javascript - jquery表单提交而不点击提交按钮

从阵列转换导致某些 MCU 崩溃，但不会导致其他 MCU 崩溃
我有一段代码看起来像这样: void update_clock(uint8_t *time_array) { time_t time = *((time_t *) &time_array[0]
IOS 崩溃 - 崩溃 EXC_CRASH (SIGABRT) - 错误 109
应用程序崩溃了 :( 请帮助我.. 在这方面失败了。我找不到错误？该应用程序可以连接到 iTunesConnect 但它会出错。谁能根据下面的崩溃报告判断问题出在哪里？ share_with_app
崩溃，我带的实习生把图片直接存到了服务器上
小二是新来的实习生，作为技术 leader，我给他安排了一个非常简单的练手任务，把前端 markdown 编辑器里上传的图片保存到服务器端，结果他真的就把图片直接保存到了服务器上，这下可把我气坏了，就
检查输入字符串是否超过缓冲区限制(崩溃)
我正在创建一个函数，它将目录路径作为参数传递，或者如果它留空，则提示用户输入。我已经设置了我的 PATH_MAX=100 和 if 语句来检查 if ((strlen(folder path) +
FreeType FT_New_Memory_Face 崩溃
我已将“arial.ttf”文件(从我的/Windows/Fonts 文件夹中获取)加载到内存中，但是将其传递到 FT_New_Memory_Face 时会崩溃(在 FT_Open_Face 中的某处
FFmpeg RTSP 崩溃
我正在尝试在我的计算机上的两个控制台之间进行 rtsp 流。在控制台 1 上，我有: ffmpeg -rtbufsize 100M -re -f dshow -s 320x240 -i video=
c++ - SSL_set_tlsext_host_name 崩溃
我正在尝试使用 scio_beast在一个项目中。我知道它还没有完成，但这并不重要。我已经设法让它工作得很好。我现在正在尝试连接到 CloudFlare 后面的服务器，我知道我需要 SNI 才能工作
从下拉列表更改工作表时 VBA 崩溃
我有一个带有关联宏的下拉列表，如下所示: Sub Drop() If Range("Hidden1!A1") = "1" Then Sheets("Sheet1").Se
执行定义与现有命令相同的函数的脚本时，Bash 崩溃
我对 bash 很陌生。我要做的就是运行这个nvvp -vm /usr/lib64/jvm/jre-1.8.0/bin/java无需记住最后的路径。我认为 instafix 就是这样做...... n
增加系统规范时 Apache 崩溃
我在 Windows 上使用 XAMPP 已经两年左右了，它运行完美，没有崩溃没有问题。 (直到四个月前。) 大约四个月前，我们将服务器/系统升级到了更快的规范。这是旧规范的内容 - Windows
在某些后台应用程序中发生 Android 崩溃
我面临着一个非常烦人的 android 崩溃，它发生在大约 1% 的 PRODUCTION session 中，应用程序始终在后台运行。 Fatal Exception: android.app.Re
android - 华为云数据库创建对象类型()崩溃
尝试使用下面的函数: public void createObjectType() { try { mCloudDB.createObjectType(ObjectTypeIn
ColdFusion 11 崩溃
由于我正在进行的一个项目，我在 CF11 管理员中弄乱了类路径，我设法使服务器崩溃，以至于我唯一得到的是一个漂亮的蓝屏和 500 错误.我已经检查了日志，我会把我能做的贴在帖子的底部，但我希望有人会启
升级后 Metasploit 崩溃
关闭。这个问题不满足Stack Overflow guidelines .它目前不接受答案。想改善这个问题吗？更新问题，使其成为 on-topic对于堆栈溢出。 10 个月前关闭。 Improve
iphone - NSPersistentStoreCoordinator 崩溃
我最近从 xcode 3.x 更新到 4.2，当我在 4.2 中运行应用程序时，我遇到了核心数据问题。我还更新到了 iOS 5，所以问题可能就在那里，我不太确定。这些应用程序在 3.x 中运行良好，
iphone - popToRootViewController 崩溃
我是一个相对较新的 iPhone 应用程序开发人员，所以我的知识有点粗略，所以如果这是一个微不足道的问题，请原谅我。我有一个导航应用程序，它通过在navigationController对象上调用p
iphone - MFMailComposeViewController 崩溃
if ([MFMailComposeViewController canSendMail]) { MFMailComposeViewController *mailViewController
iphone - UILocalNotification 崩溃
你能帮我吗？我正在设置 UILocalNotification，当我尝试设置其 userInfo 字典时，它崩溃了。 fetchedObjects 包含 88 个对象。这是代码: NSDi
iphone - NSFastEnumerationMutationHandler 崩溃
为什么我的代码中突然出现 NSFastEnumeration Mutation Handler 崩溃。我很茫然为什么会突然出现这个崩溃以及如何解决它。最佳答案崩溃错误: **** 由于未捕获的异常
iphone - deleteRowsAtIndexPaths 崩溃
当我从表中删除行时，我的应用程序崩溃了。这是我检测到错误和堆栈跟踪的来源。谢谢! //delete row from database - (void)tableView:(UITableView *

太空宇宙

个人简介

我是一名优秀的程序员,十分优秀！

作者热门文章

滴滴打车优惠券免费领取

全站热门文章

首页

博学

6Ren·AI

商城

python - 在 GPU 上使用 tf.train.Saver() 导致 Tensorflow 崩溃