gpt4 book ai didi

python - 分布式 tensorflow : ValueError “When: When using replicas, all Variables must have their device set” set: name: "Variable"

转载 作者:太空狗 更新时间:2023-10-29 22:29:12 24 4
gpt4 key购买 nike

我正在尝试在 独立模式 的 tensorflow 上编写分布式变分自动编码器。

我的集群包括 3 台机器,分别命名为 m1、m2 和 m3。我正在尝试在 m1 上运行 1 个 ps 服务器,在 m2 和 m3 上运行 2 个工作服务器。 (示例培训师计划在 distributed tensorflow documentation 中)在 m3 上,我收到以下错误消息:

Traceback (most recent call last): 
File "/home/yama/mfs/ZhuSuan/examples/vae.py", line 241, in <module>
save_model_secs=600)
File "/mfs/yama/tensorflow/local/lib/python2.7/site-packages/tensorflow/python/training/supervisor.py", line 334, in __init__
self._verify_setup()
File "/mfs/yama/tensorflow/local/lib/python2.7/site-packages/tensorflow/python/training/supervisor.py", line 863, in _verify_setup
"their device set: %s" % op)
ValueError: When using replicas, all Variables must have their device set: name: "Variable"
op: "Variable"
attr {
key: "container"
value {
s: ""
}
}
attr {
key: "dtype"
value {
type: DT_INT32
}
}
attr {
key: "shape"
value {
shape {
}
}
}
attr {
key: "shared_name"
value {
s: ""
}
}

这是我的代码的一部分,它定义了网络和主管。

if FLAGS.job_name == "ps":
server.join()
elif FLAGS.job_name == "worker":

#set distributed device
with tf.device(tf.train.replica_device_setter(
worker_device="/job:worker/task:%d" % FLAGS.task_index,
cluster=clusterSpec)):

# Build the training computation graph
x = tf.placeholder(tf.float32, shape=(None, x_train.shape[1]))
optimizer = tf.train.AdamOptimizer(learning_rate=0.001, epsilon=1e-4)
with tf.variable_scope("model") as scope:
with pt.defaults_scope(phase=pt.Phase.train):
train_model = M1(n_z, x_train.shape[1])
train_vz_mean, train_vz_logstd = q_net(x, n_z)
train_variational = ReparameterizedNormal(
train_vz_mean, train_vz_logstd)
grads, lower_bound = advi(
train_model, x, train_variational, lb_samples, optimizer)
infer = optimizer.apply_gradients(grads)
#print(type(lower_bound))

# Build the evaluation computation graph
with tf.variable_scope("model", reuse=True) as scope:
with pt.defaults_scope(phase=pt.Phase.test):
eval_model = M1(n_z, x_train.shape[1])
eval_vz_mean, eval_vz_logstd = q_net(x, n_z)
eval_variational = ReparameterizedNormal(
eval_vz_mean, eval_vz_logstd)
eval_lower_bound = is_loglikelihood(
eval_model, x, eval_variational, lb_samples)
eval_log_likelihood = is_loglikelihood(
eval_model, x, eval_variational, ll_samples)

#saver = tf.train.Saver()
summary_op = tf.merge_all_summaries()
global_step = tf.Variable(0)
init_op = tf.initialize_all_variables()

# Create a "supervisor", which oversees the training process.
sv = tf.train.Supervisor(is_chief=(FLAGS.task_index == 0),
logdir=LogDir,
init_op=init_op,
summary_op=summary_op,
# saver=saver,
global_step=global_step,
save_model_secs=600)
print("create sv done")

我认为我的代码一定有问题,但我不知道如何修复它。有什么建议吗?非常感谢!

最佳答案

问题源于您的 global_step 变量的定义:

global_step = tf.Variable(0)

此定义超出了上面的 with tf.device(tf.train.replica_device_setter(...)): block 的范围,因此没有设备分配给 global_step。在重复训练中,这通常是错误的来源(因为如果不同的副本决定将变量放在不同的设备上,它们将不会共享相同的值),因此 TensorFlow 包含一个健全性检查来防止这种情况。

幸运的是,解决方案很简单。您可以在上面的 with tf.device(tf.train.replica_device_setter(...)): block 中定义 global_step,或者添加一个小的 with tf .device("/job:ps/task:0"): block 如下:

with tf.device("/job:ps/task:0"):
global_step = tf.Variable(0, name="global_step")

关于python - 分布式 tensorflow : ValueError “When: When using replicas, all Variables must have their device set” set: name: "Variable",我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/38793718/

24 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com