作者热门文章
- html - 出于某种原因,IE8 对我的 Sass 文件中继承的 html5 CSS 不友好?
- JMeter 在响应断言中使用 span 标签的问题
- html - 在 :hover and :active? 上具有不同效果的 CSS 动画
- html - 相对于居中的 html 内容固定的 CSS 重复背景?
我按照tensorflow官方网站的例子。
https://www.tensorflow.org/tutorials/distribute/multi_worker_with_keras
这是我的规范
世界 super 联赛
Ubuntu 16.04.6 长期支持版
TensorFlow2.0
没有可用的 GPU
我有一个名为“tfexample.py”的文件,看起来像这样
from __future__ import absolute_import, division, print_function, unicode_literals
import tensorflow_datasets as tfds
import tensorflow as tf
import json, os
tfds.disable_progress_bar()
os.environ["TF_CONFIG"] = json.dumps(
{
"cluster": {"worker": ["localhost:12345", "localhost:23456"]},
"task": {"type": "worker", "index": 0},
}
)
strategy = tf.distribute.experimental.MultiWorkerMirroredStrategy()
BUFFER_SIZE = 10000
BATCH_SIZE = 64
def make_datasets_unbatched():
# Scaling MNIST data from (0, 255] to (0., 1.]
def scale(image, label):
image = tf.cast(image, tf.float32)
image /= 255
return image, label
datasets, info = tfds.load(name="mnist", with_info=True, as_supervised=True)
return datasets["train"].map(scale).cache().shuffle(BUFFER_SIZE)
train_datasets = make_datasets_unbatched().batch(BATCH_SIZE)
def build_and_compile_cnn_model():
model = tf.keras.Sequential(
[
tf.keras.layers.Conv2D(32, 3, activation="relu", input_shape=(28, 28, 1)),
tf.keras.layers.MaxPooling2D(),
tf.keras.layers.Flatten(),
tf.keras.layers.Dense(64, activation="relu"),
tf.keras.layers.Dense(10, activation="softmax"),
]
)
model.compile(
loss=tf.keras.losses.sparse_categorical_crossentropy,
optimizer=tf.keras.optimizers.SGD(learning_rate=0.001),
metrics=["accuracy"],
)
return model
# single_worker_model = build_and_compile_cnn_model()
# single_worker_model.fit(x=train_datasets, epochs=3, steps_per_epoch=5)
NUM_WORKERS = 2
# Here the batch size scales up by number of workers since
# `tf.data.Dataset.batch` expects the global batch size. Previously we used 64,
# and now this becomes 128.
GLOBAL_BATCH_SIZE = 64 * NUM_WORKERS
with strategy.scope():
# Creation of dataset, and model building/compiling need to be within
# `strategy.scope()`.
train_datasets = make_datasets_unbatched().batch(GLOBAL_BATCH_SIZE)
multi_worker_model = build_and_compile_cnn_model()
# Keras' `model.fit()` trains the model with specified number of epochs and
# number of steps per epoch. Note that the numbers here are for demonstration
# purposes only and may not sufficiently produce a model with good quality.
multi_worker_model.fit(x=train_datasets, epochs=3, steps_per_epoch=5)
当我用
运行这个文件时python tfexample.py
终端就这样挂了
2020-02-04 17:50:23.483411: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libnvinfer.so.6'; dlerror: libnvinfer.so.6: cannot open shared object file: No such file or directory
2020-02-04 17:50:23.485194: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libnvinfer_plugin.so.6'; dlerror: libnvinfer_plugin.so.6: cannot open shared object file: No such file or directory
2020-02-04 17:50:23.485747: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:30] Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly.
/home/danny/.local/lib/python2.7/site-packages/requests/__init__.py:83: RequestsDependencyWarning: Old version of cryptography ([1, 2, 3]) may cause slowdown.
warnings.warn(warning, RequestsDependencyWarning)
2020-02-04 17:50:29.013263: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libcuda.so.1'; dlerror: libcuda.so.1: cannot open shared object file: No such file or directory
2020-02-04 17:50:29.014152: E tensorflow/stream_executor/cuda/cuda_driver.cc:351] failed call to cuInit: UNKNOWN ERROR (303)
2020-02-04 17:50:29.014781: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:156] kernel driver does not appear to be running on this host (WINDOWS-6DFFM0Q): /proc/driver/nvidia/version does not exist
2020-02-04 17:50:29.015780: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2020-02-04 17:50:29.025575: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 2701000000 Hz
2020-02-04 17:50:29.027050: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x66b11a0 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
2020-02-04 17:50:29.027669: I tensorflow/compiler/xla/service/service.cc:176] StreamExecutor device (0): Host, Default Version
E0204 17:50:29.038614800 24084 socket_utils_common_posix.cc:198] check for SO_REUSEPORT: {"created":"@1580856629.038575000","description":"Protocol not available","errno":92,"file":"external/grpc/src/core/lib/iomgr/socket_utils_common_posix.cc","file_line":175,"os_error":"Protocol not available","syscall":"getsockopt(SO_REUSEPORT)"}
E0204 17:50:29.039313500 24084 socket_utils_common_posix.cc:299] setsockopt(TCP_USER_TIMEOUT) Protocol not available
2020-02-04 17:50:29.051180: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:300] Initialize GrpcChannelCache for job worker -> {0 -> localhost:12345, 1 -> localhost:23456}
2020-02-04 17:50:29.053392: I tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc:390] Started server with target: grpc://localhost:12345
任何帮助将不胜感激!
最佳答案
您是否使用正确的 TFconfig 在两个 session 中运行 tfexample.py。我没有在同一台机器上尝试过两个实例
关于Tensorflow2.0 MultiWorkerMirroredStrategy 示例挂起,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/60066814/
我按照tensorflow官方网站的例子。 https://www.tensorflow.org/tutorials/distribute/multi_worker_with_keras 这是我的规范
我在使用 MultiWorkerMirroredStrategy() 在 Google AI 平台 (CMLE) 上训练自定义估算器时遇到以下错误。 ValueError: Unrecognized
我是一名优秀的程序员,十分优秀!