python - 分布式 tensorflow : ValueError “When: When using replicas, all Variables must have their device set” set: name: "Variable"-6ren

python - 分布式 tensorflow : ValueError “When: When using replicas, all Variables must have their device set” set: name: "Variable"

转载作者：太空狗更新时间：2023-10-29 22:29:12

24

4

我正在尝试在 独立模式 的 tensorflow 上编写分布式变分自动编码器。

我的集群包括 3 台机器，分别命名为 m1、m2 和 m3。我正在尝试在 m1 上运行 1 个 ps 服务器，在 m2 和 m3 上运行 2 个工作服务器。 (示例培训师计划在 distributed tensorflow documentation 中)在 m3 上，我收到以下错误消息:

Traceback (most recent call last): 
 File "/home/yama/mfs/ZhuSuan/examples/vae.py", line 241, in <module> 
   save_model_secs=600) 
 File "/mfs/yama/tensorflow/local/lib/python2.7/site-packages/tensorflow/python/training/supervisor.py", line 334, in __init__ 
   self._verify_setup() 
 File "/mfs/yama/tensorflow/local/lib/python2.7/site-packages/tensorflow/python/training/supervisor.py", line 863, in _verify_setup 
   "their device set: %s" % op) 
ValueError: When using replicas, all Variables must have their device set: name: "Variable"
op: "Variable" 
attr { 
 key: "container" 
 value { 
   s: "" 
 } 
} 
attr { 
 key: "dtype" 
 value { 
   type: DT_INT32 
 } 
} 
attr { 
 key: "shape" 
 value { 
   shape { 
   } 
 } 
} 
attr { 
 key: "shared_name" 
 value { 
   s: "" 
 } 
}

这是我的代码的一部分，它定义了网络和主管。

if FLAGS.job_name == "ps":
    server.join()
elif FLAGS.job_name == "worker":

    #set distributed device
    with tf.device(tf.train.replica_device_setter(
        worker_device="/job:worker/task:%d" % FLAGS.task_index,
        cluster=clusterSpec)):

        # Build the training computation graph
        x = tf.placeholder(tf.float32, shape=(None, x_train.shape[1]))
        optimizer = tf.train.AdamOptimizer(learning_rate=0.001, epsilon=1e-4)
        with tf.variable_scope("model") as scope:
            with pt.defaults_scope(phase=pt.Phase.train):
                train_model = M1(n_z, x_train.shape[1])
                train_vz_mean, train_vz_logstd = q_net(x, n_z)
                train_variational = ReparameterizedNormal(
                    train_vz_mean, train_vz_logstd)
                grads, lower_bound = advi(
                    train_model, x, train_variational, lb_samples, optimizer)
                infer = optimizer.apply_gradients(grads)
        #print(type(lower_bound))

        # Build the evaluation computation graph
        with tf.variable_scope("model", reuse=True) as scope:
            with pt.defaults_scope(phase=pt.Phase.test):
                eval_model = M1(n_z, x_train.shape[1])
                eval_vz_mean, eval_vz_logstd = q_net(x, n_z)
                eval_variational = ReparameterizedNormal(
                    eval_vz_mean, eval_vz_logstd)
                eval_lower_bound = is_loglikelihood(
                    eval_model, x, eval_variational, lb_samples)
                eval_log_likelihood = is_loglikelihood(
                    eval_model, x, eval_variational, ll_samples)

    #saver = tf.train.Saver()
    summary_op = tf.merge_all_summaries()
    global_step = tf.Variable(0)
    init_op = tf.initialize_all_variables()

    # Create a "supervisor", which oversees the training process.
    sv = tf.train.Supervisor(is_chief=(FLAGS.task_index == 0), 
                             logdir=LogDir,
                             init_op=init_op,
                             summary_op=summary_op,
    #                         saver=saver,
                             global_step=global_step,
                             save_model_secs=600)
    print("create sv done")

我认为我的代码一定有问题，但我不知道如何修复它。有什么建议吗？非常感谢!

最佳答案

问题源于您的 global_step 变量的定义:

global_step = tf.Variable(0)

此定义超出了上面的 with tf.device(tf.train.replica_device_setter(...)): block 的范围，因此没有设备分配给 global_step。在重复训练中，这通常是错误的来源(因为如果不同的副本决定将变量放在不同的设备上，它们将不会共享相同的值)，因此 TensorFlow 包含一个健全性检查来防止这种情况。

幸运的是，解决方案很简单。您可以在上面的 with tf.device(tf.train.replica_device_setter(...)): block 中定义 global_step，或者添加一个小的 with tf .device("/job:ps/task:0"): block 如下:

with tf.device("/job:ps/task:0"):
    global_step = tf.Variable(0, name="global_step")

关于python - 分布式 tensorflow : ValueError “When: When using replicas, all Variables must have their device set” set: name: "Variable"，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/38793718/

24

4

0

文章推荐： python - 使用 openpyxl 编辑 Excel 文件时丢失合并单元格边框

文章推荐： C# ASCII 或 Unicode

文章推荐： python - conv2d_transpose 在进行预测时依赖于 batch_size

variables - : %%a and %variable% variables? 有什么区别
for /f "tokens=*" %%a in ('find /v ":" "%appdata%\gamelauncher\options.txt" ^| find "menu=a"') do (
Javascript "Variable Variables": how to assign variable based on another variable?
我在 Javascript 中有一组全局计数器变量: var counter_0 = 0; var counter_1 = 0; var counter_2 = 0; 等等然后我有一个 Javasc
java - 语法 "variable = variable = variable;"发生了什么？
好的，我正在阅读一些有关 RedBlackTrees 的代码。我注意到这一行“v1 = v2 = v3 = v4;”我理解类似“v1 += v2”(将 v2 添加到 v1 的当前值)和“v1 = v2
c# - 从数组值声明 "variable variables"或 "dynamic variables"？
我正在为 C# 中的游戏数据加载制作一个 csv 阅读器，我想做的就是从数组(变量)的值声明一个变量，我们可以在 php 中像 $$foo 那样做。喜欢 void csvReader(string s
variables - BAT 文件 : variable contents as part of another variable
假设我有变量内容为“ 123 ”和变量 b123 里面有一些文字。出于某种原因，我想使用变量作为第二个 var 名称的一部分。像这样的东西: SET a=123 SET b123=some_tex
javascript - 有没有办法在javascript中执行类似 if (Variable == 1 or Variable == 2 or Variable == 3) 的操作？
我对 javascript 有点陌生，我无法通过谷歌搜索找到任何内容，我正在编写一个程序，并且能够执行我所要求的操作: if (Variable == 1 或 Variable == 2 或 Vari
php - 简写做类似 : if($variable == 1 || $variable == "whatever" || $variable == '492' ) . 的事情
我发现我自己在做这种类型的 IF 语句分配。例如: if($variable == 1 || $variable == "whatever" || $variable == '492') { ...
variables - Echo %variable% 在 MS-DOS 6.22 中显示 %variable%
我的虚拟 PC 在 MS-DOS 6.22 上运行时出现问题。我需要使用变量 Date ，但我无法得到它，因为每当我尝试回显变量时，它都会显示 %variable%反而。我在 Windows 控制
variables - 语法错误解析 JPQL : An identification variable must be provided for a range variable declaration
尝试运行此代码时: List list = em.createQuery("select balance b from Users where b.userName = '" + user_name.
javascript - 使用 variable != null 而不是 variable !== undefined && variable !== null 是否可以接受？
我有一些代码，其中变量可以是 undefined、null 或正常值。无论变量是 undefined 还是 null，代码都需要做同样的事情。说有没有危险 for (var cur = this.bu
Windows 批处理命令 : How to dereference FOR loop variable to check if that variable is SET in Environment Variable
我正在编写一个批处理命令脚本，其中检查环境变量。我需要通过传递所有必需的变量来编写一个 FOR 循环，然后验证它是否已定义，如果未定义，则提示该键的值并永久设置该变量。问题是我无法取消引用循环变量并
ruby-on-rails - ruby 中的 "="& "=>"和 "@variable"、 "@@variable"和 ":variable"有什么区别？
我知道这些是 Rails 的基础知识，但我仍然不知道 = 符号和 => 之间的全部区别以及 @some_variable 之间的区别、@@some_variable 和 :some_variable
rebol - 评估 "variable variable"
我正在使用以下内容创建一个动态变量(PHP 术语中的“变量变量”): foo: "test1" set to-word (rejoin [foo "_result_data"]) array 5 但是
php - !$variable = $variable inside if
我一直在啃 PHP 套接字服务器和客户端的基础知识 here . 然后我偶然发现了这些行(摘自上面链接的第一个示例，发生在 while 中): if (false === ($buf = socket
java - variable |= variable 是什么意思？
这个问题在这里已经有了答案: What does "|=" mean? (pipe equal operator) (6 个答案) 关闭 9 年前。我正在寻找一些编码来扩展我在 Java 方面的知
C++ : value from variable as variable
如何在 C++ 中从其他变量的值打印变量我只是 C++ 的新手。在 php 中，我们可以通过其他变量的值来制作/打印一个变量。像这样。 $example = 'foo'; $foo = 'abc';
ruby - :variable and @variable 之间的差异
作为 Ruby on Rails 新手，我明白“@”和“:”引用有不同的含义。我看到了this post在 SO 中，其中描述了一些差异。 @ 表示实例变量(例如@my_selection) :表示别
variables - 去编译错误: undefined variables
编程新手/甚至更新。一个小的 go 程序有问题 - 不会编译带有 undefined variable 错误。代码: package main import ( "fmt" "io" "o
How do I create variable variables?(如何创建变量变量？)
我知道其他一些语言，如PHP，支持“变量变量名”的概念--即，字符串的内容可以用作变量名的一部分。。我听说总的来说这不是一个好主意，但我认为它可以解决我在Python代码中遇到的一些问题。。有没有可能
java - Java类中的 "int variable = 0;"和 "int variable; variable = 0;"有什么区别？
我有两个版本的代码。版本 1 Launcher.java class Launcher { public static void main(String[] args) {

首页

博学

6Ren·AI

商城

python - 分布式 tensorflow : ValueError “When: When using replicas, all Variables must have their device set” set: name: "Variable"