gpt4 book ai didi

python - 分布式 tensorflow 如何工作? (与tf.train.Server一起发出)

转载 作者:行者123 更新时间:2023-12-03 16:28:30 25 4
gpt4 key购买 nike

我对使用tensorflow的新选项感到有些麻烦,该选项允许我们运行分布式tensorflow。

我只想运行2个任务的2 tf.constant,但是我的代码永无止境。它看起来像这样:

import tensorflow as tf

cluster = tf.train.ClusterSpec({"local": ["localhost:2222", "localhost:2223"]})
server = tf.train.Server(cluster,
job_name="local",
task_index=0)

with tf.Session(server.target) as sess:
with tf.device("/job:local/replica:0/task:0"):
const1 = tf.constant("Hello I am the first constant")
with tf.device("/job:local/replica:0/task:1"):
const2 = tf.constant("Hello I am the second constant")
print sess.run([const1, const2])

而且我有以下代码有效(仅在一个localhost:2222上运行):
import tensorflow as tf

cluster = tf.train.ClusterSpec({"local": ["localhost:2222"]})
server = tf.train.Server(cluster,
job_name="local",
task_index=0)

with tf.Session(server.target) as sess:
with tf.device("/job:local/replica:0/task:0"):
const1 = tf.constant("Hello I am the first constant")
const2 = tf.constant("Hello I am the second constant")
print sess.run([const1, const2])

out : ['Hello I am the first constant', 'Hello I am the second constant']

也许我不了解这些功能...所以,如果您有个主意,请告诉我。

谢谢 ;)。

编辑

好的,我发现使用ipython笔记本无法像我一样运行它。我需要一个python程序并在终端上执行它。
但是现在我在运行代码时遇到了一个新问题,现在服务器尝试连接到给定的2个端口,而我告诉他只能在一个端口上运行。
我的新代码如下所示:
import tensorflow as tf

tf.app.flags.DEFINE_string('job_name', '', 'One of local worker')
tf.app.flags.DEFINE_string('local', '', """Comma-separated list of hostname:port for the """)

tf.app.flags.DEFINE_integer('task_id', 0, 'Task ID of local/replica running the training')
tf.app.flags.DEFINE_integer('constant_id', 0, 'the constant we want to run')

FLAGS = tf.app.flags.FLAGS

local_host = FLAGS.local.split(',')

cluster = tf.train.ClusterSpec({"local": local_host})
server = tf.train.Server(cluster, job_name=FLAGS.job_name, task_index=FLAGS.task_id)

with tf.Session(server.target) as sess:
if(FLAGS.constant_id == 0):
with tf.device('/job:local/task:'+str(FLAGS.task_id)):
const1 = tf.constant("Hello I am the first constant")
print sess.run(const1)
if (FLAGS.constant_id == 1):
with tf.device('/job:local/task:'+str(FLAGS.task_id)):
const2 = tf.constant("Hello I am the second constant")
print sess.run(const2)

我运行以下命令行
python test_distributed_tensorflow.py --local=localhost:3000,localhost:3001 --job_name=local --task_id=0 --constant_id=0

我得到以下日志
I tensorflow/core/common_runtime/gpu/gpu_init.cc:126] DMA: 0 
I tensorflow/core/common_runtime/gpu/gpu_init.cc:136] 0: Y
I tensorflow/core/common_runtime/gpu/gpu_device.cc:755] Creating TensorFlow device (/gpu:0) -> (device: 0, name: GeForce GTX 970M, pci bus id: 0000:01:00.0)
I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:206] Initialize HostPortsGrpcChannelCache for job local -> {localhost:3000, localhost:3001}
I tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc:202] Started server with target: grpc://localhost:3000
E0518 15:27:11.794873779 10884 tcp_client_posix.c:173] failed to connect to 'ipv4:127.0.0.1:3001': socket error: connection refused
E0518 15:27:12.795184395 10884 tcp_client_posix.c:173] failed to connect to 'ipv4:127.0.0.1:3001': socket error: connection refused
...

编辑2

我找到了解决方案。只需执行我们交给服务器的所有任务即可。所以我必须运行这个:
python test_distributed_tensorflow.py --local=localhost:2345,localhost:2346 --job_name=local --task_id=0 --constant_id=0 \
& \
python test_distributed_tensorflow.py --local=localhost:2345,localhost:2346 --job_name=local --task_id=1 --constant_id=1

我希望可以帮助某人;)

最佳答案

Tensorflow的最新版本提供distribution strategy以在多个系统上工作。
通过示例的分发策略进行了解释,看看这个link

关于python - 分布式 tensorflow 如何工作? (与tf.train.Server一起发出),我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/37294201/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com