Running into GPU related error while working with latest tensorflow ( 2.13 ) . Please note the test model training provided on tensorflow-metal page to verify my setup works fine.
使用最新的TensorFlow(2.13)时遇到与GPU相关的错误。请注意TensorFlow-Metals页面上提供的测试模型培训,以验证我的设置工作正常。
Please advise.
请指点一下。
Below is the command I used - the script is from [github.com/tensorflow/models][1]
以下是我使用的命令-该脚本来自[gihub.com/TensorFlow/Models][1]
python3 model_main_tf2.py --model_dir=models/ark_mask_rcnn_inception_resnet_v2 --pipeline_config_path=models/ark_mask_rcnn_inception_resnet_v2/pipeline.config
tensorflow.python.framework.errors_impl.InvalidArgumentError: {{function_node __wrapped__IteratorGetNext_output_types_18_device_/job:localhost/replica:0/task:0/device:GPU:0}} indices[0] = 0 is not in [0, 0)
[[{{node GatherV2_7}}]]
[[MultiDeviceIteratorGetNextFromShard]]
[[RemoteCall]] [Op:IteratorGetNext] name:
The above are the last lines of the error message. below is the full log from the model training script
以上是错误消息的最后几行。以下是模型培训脚本中的完整日志
2023-09-10 20:06:55.580212: I metal_plugin/src/device/metal_device.cc:296] systemMemory: 32.00 GB
2023-09-10 20:06:55.580217: I metal_plugin/src/device/metal_device.cc:313] maxCacheSize: 10.67 GB
2023-09-10 20:06:55.580248: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:303] Could not identify NUMA node of platform GPU ID 0, defaulting to 0. Your kernel may not have been built with NUMA support.
2023-09-10 20:06:55.580265: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:269] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 0 MB memory) -> physical PluggableDevice (device: 0, name: METAL, pci bus id: <undefined>)
2023-09-10 20:06:55.581703: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:303] Could not identify NUMA node of platform GPU ID 0, defaulting to 0. Your kernel may not have been built with NUMA support.
2023-09-10 20:06:55.581712: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:269] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 0 MB memory) -> physical PluggableDevice (device: 0, name: METAL, pci bus id: <undefined>)
INFO:tensorflow:Using MirroredStrategy with devices ('/job:localhost/replica:0/task:0/device:GPU:0',)
I0910 20:06:55.581999 8568659456 mirrored_strategy.py:419] Using MirroredStrategy with devices ('/job:localhost/replica:0/task:0/device:GPU:0',)
INFO:tensorflow:Maybe overwriting train_steps: None
I0910 20:06:55.590664 8568659456 config_util.py:552] Maybe overwriting train_steps: None
INFO:tensorflow:Maybe overwriting use_bfloat16: False
I0910 20:06:55.590721 8568659456 config_util.py:552] Maybe overwriting use_bfloat16: False
WARNING:tensorflow:From /Users/_dga/ml-git/tf-venv/lib/python3.9/site-packages/object_detection/model_lib_v2.py:563: StrategyBase.experimental_distribute_datasets_from_function (from tensorflow.python.distribute.distribute_lib) is deprecated and will be removed in a future version.
Instructions for updating:
rename to distribute_datasets_from_function
W0910 20:06:55.605112 8568659456 deprecation.py:364] From /Users/_dga/ml-git/tf-venv/lib/python3.9/site-packages/object_detection/model_lib_v2.py:563: StrategyBase.experimental_distribute_datasets_from_function (from tensorflow.python.distribute.distribute_lib) is deprecated and will be removed in a future version.
Instructions for updating:
rename to distribute_datasets_from_function
INFO:tensorflow:Reading unweighted datasets: ['annotations/train.record']
I0910 20:06:55.607398 8568659456 dataset_builder.py:162] Reading unweighted datasets: ['annotations/train.record']
INFO:tensorflow:Reading record datasets for input file: ['annotations/train.record']
I0910 20:06:55.607451 8568659456 dataset_builder.py:79] Reading record datasets for input file: ['annotations/train.record']
INFO:tensorflow:Number of filenames to read: 1
I0910 20:06:55.607482 8568659456 dataset_builder.py:80] Number of filenames to read: 1
WARNING:tensorflow:num_readers has been reduced to 1 to match input file shards.
W0910 20:06:55.607504 8568659456 dataset_builder.py:86] num_readers has been reduced to 1 to match input file shards.
WARNING:tensorflow:From /Users/_dga/ml-git/tf-venv/lib/python3.9/site-packages/object_detection/builders/dataset_builder.py:100: parallel_interleave (from tensorflow.python.data.experimental.ops.interleave_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use `tf.data.Dataset.interleave(map_func, cycle_length, block_length, num_parallel_calls=tf.data.AUTOTUNE)` instead. If sloppy execution is desired, use `tf.data.Options.deterministic`.
W0910 20:06:55.610141 8568659456 deprecation.py:364] From /Users/_dga/ml-git/tf-venv/lib/python3.9/site-packages/object_detection/builders/dataset_builder.py:100: parallel_interleave (from tensorflow.python.data.experimental.ops.interleave_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use `tf.data.Dataset.interleave(map_func, cycle_length, block_length, num_parallel_calls=tf.data.AUTOTUNE)` instead. If sloppy execution is desired, use `tf.data.Options.deterministic`.
WARNING:tensorflow:From /Users/_dga/ml-git/tf-venv/lib/python3.9/site-packages/object_detection/builders/dataset_builder.py:235: DatasetV1.map_with_legacy_function (from tensorflow.python.data.ops.dataset_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use `tf.data.Dataset.map()
W0910 20:06:55.618376 8568659456 deprecation.py:364] From /Users/_dga/ml-git/tf-venv/lib/python3.9/site-packages/object_detection/builders/dataset_builder.py:235: DatasetV1.map_with_legacy_function (from tensorflow.python.data.ops.dataset_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use `tf.data.Dataset.map()
WARNING:tensorflow:From /Users/_dga/ml-git/tf-venv/lib/python3.9/site-packages/tensorflow/python/autograph/impl/api.py:459: calling map_fn (from tensorflow.python.ops.map_fn) with dtype is deprecated and will be removed in a future version.
Instructions for updating:
Use fn_output_signature instead
W0910 20:06:56.389322 8568659456 deprecation.py:569] From /Users/_dga/ml-git/tf-venv/lib/python3.9/site-packages/tensorflow/python/autograph/impl/api.py:459: calling map_fn (from tensorflow.python.ops.map_fn) with dtype is deprecated and will be removed in a future version.
Instructions for updating:
Use fn_output_signature instead
WARNING:tensorflow:From /Users/_dga/ml-git/tf-venv/lib/python3.9/site-packages/tensorflow/python/util/dispatch.py:1176: sparse_to_dense (from tensorflow.python.ops.sparse_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Create a `tf.sparse.SparseTensor` and use `tf.sparse.to_dense` instead.
W0910 20:06:58.673335 8568659456 deprecation.py:364] From /Users/_dga/ml-git/tf-venv/lib/python3.9/site-packages/tensorflow/python/util/dispatch.py:1176: sparse_to_dense (from tensorflow.python.ops.sparse_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Create a `tf.sparse.SparseTensor` and use `tf.sparse.to_dense` instead.
WARNING:tensorflow:From /Users/_dga/ml-git/tf-venv/lib/python3.9/site-packages/tensorflow/python/util/dispatch.py:1176: to_float (from tensorflow.python.ops.math_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use `tf.cast` instead.
W0910 20:06:59.748894 8568659456 deprecation.py:364] From /Users/_dga/ml-git/tf-venv/lib/python3.9/site-packages/tensorflow/python/util/dispatch.py:1176: to_float (from tensorflow.python.ops.math_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use `tf.cast` instead.
2023-09-10 20:07:01.205124: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:114] Plugin optimizer for device_type GPU is enabled.
2023-09-10 20:07:01.207747: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:114] Plugin optimizer for device_type GPU is enabled.
Traceback (most recent call last):
File "/Users/_dga/ml-git/tf-ark/Tensorflow/workspace/training_demo/model_main_tf2.py", line 126, in <module>
tf.compat.v1.app.run()
File "/Users/_dga/ml-git/tf-venv/lib/python3.9/site-packages/tensorflow/python/platform/app.py", line 36, in run
_run(main=main, argv=argv, flags_parser=_parse_flags_tolerate_undef)
File "/Users/_dga/ml-git/tf-venv/lib/python3.9/site-packages/absl/app.py", line 308, in run
_run_main(main, args)
File "/Users/_dga/ml-git/tf-venv/lib/python3.9/site-packages/absl/app.py", line 254, in _run_main
sys.exit(main(argv))
File "/Users/_dga/ml-git/tf-ark/Tensorflow/workspace/training_demo/model_main_tf2.py", line 117, in main
model_lib_v2.train_loop(
File "/Users/_dga/ml-git/tf-venv/lib/python3.9/site-packages/object_detection/model_lib_v2.py", line 605, in train_loop
load_fine_tune_checkpoint(
File "/Users/_dga/ml-git/tf-venv/lib/python3.9/site-packages/object_detection/model_lib_v2.py", line 401, in load_fine_tune_checkpoint
_ensure_model_is_built(model, input_dataset, unpad_groundtruth_tensors)
File "/Users/_dga/ml-git/tf-venv/lib/python3.9/site-packages/object_detection/model_lib_v2.py", line 161, in _ensure_model_is_built
features, labels = iter(input_dataset).next()
File "/Users/_dga/ml-git/tf-venv/lib/python3.9/site-packages/tensorflow/python/distribute/input_lib.py", line 260, in next
return self.__next__()
File "/Users/_dga/ml-git/tf-venv/lib/python3.9/site-packages/tensorflow/python/distribute/input_lib.py", line 264, in __next__
return self.get_next()
File "/Users/_dga/ml-git/tf-venv/lib/python3.9/site-packages/tensorflow/python/distribute/input_lib.py", line 325, in get_next
return self._get_next_no_partial_batch_handling(name)
File "/Users/_dga/ml-git/tf-venv/lib/python3.9/site-packages/tensorflow/python/distribute/input_lib.py", line 361, in _get_next_no_partial_batch_handling
replicas.extend(self._iterators[i].get_next_as_list(new_name))
File "/Users/_dga/ml-git/tf-venv/lib/python3.9/site-packages/tensorflow/python/distribute/input_lib.py", line 1427, in get_next_as_list
return self._format_data_list_with_options(self._iterator.get_next())
File "/Users/_dga/ml-git/tf-venv/lib/python3.9/site-packages/tensorflow/python/data/ops/multi_device_iterator_ops.py", line 553, in get_next
result.append(self._device_iterators[i].get_next())
File "/Users/_dga/ml-git/tf-venv/lib/python3.9/site-packages/tensorflow/python/data/ops/iterator_ops.py", line 867, in get_next
return self._next_internal()
File "/Users/_dga/ml-git/tf-venv/lib/python3.9/site-packages/tensorflow/python/data/ops/iterator_ops.py", line 777, in _next_internal
ret = gen_dataset_ops.iterator_get_next(
File "/Users/_dga/ml-git/tf-venv/lib/python3.9/site-packages/tensorflow/python/ops/gen_dataset_ops.py", line 3028, in iterator_get_next
_ops.raise_from_not_ok_status(e, name)
File "/Users/_dga/ml-git/tf-venv/lib/python3.9/site-packages/tensorflow/python/framework/ops.py", line 6656, in raise_from_not_ok_status
raise core._status_to_exception(e) from None # pylint: disable=protected-access
tensorflow.python.framework.errors_impl.InvalidArgumentError: {{function_node __wrapped__IteratorGetNext_output_types_18_device_/job:localhost/replica:0/task:0/device:GPU:0}} indices[0] = 0 is not in [0, 0)
[[{{node GatherV2_7}}]]
[[MultiDeviceIteratorGetNextFromShard]]
[[RemoteCall]] [Op:IteratorGetNext] name: ```
[1]: https://github.com/tensorflow/models/blob/master/research/object_detection/model_main_tf2.py
running the setup verification script available on apple tensorflow-metal page i.e.
运行Apple TensorFlow-Metals页面上提供的安装验证脚本,即
import tensorflow as tf
cifar = tf.keras.datasets.cifar100
(x_train, y_train), (x_test, y_test) = cifar.load_data()
model = tf.keras.applications.ResNet50(
include_top=True,
weights=None,
input_shape=(32, 32, 3),
classes=100,)
loss_fn = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=False)
model.compile(optimizer="adam", loss=loss_fn, metrics=["accuracy"])
model.fit(x_train, y_train, epochs=5, batch_size=64) ```
works fine i.e. detects the device etc.
工作正常,即检测到设备等。
更多回答
This answer / assumption also seems to be incorrect. Training the same model on UBUNTU machine with GPU / CPU also faile with identical error.
这个答案/假设似乎也是不正确的。在使用GPU/CPU的Ubuntu机器上训练相同的模型也失败,并出现相同的错误。
Found this issue listed since 2020 on github issue on github
在GitHub上发现自2020年以来在GitHub问题上列出的此问题
For future reference for self and others -
为了将来自己和他人的参考-
On the same machine I could successfully move ahead with my training for other categories of models and couldn't find any specific response to the question of why this error shows up for this specific model type i.e. mask_rcnn_inception_resnet.
在同一台机器上,我可以成功地继续进行其他类别的模型的培训,但对于为什么这个特定的模型类型(即MASK_RCNN_INVERATION_RESNET)会出现这个错误的问题,我找不到任何具体的回答。
Thus I concluded that since this model is not supported on TPU's yet it cannot run on Mac M2 where though its called a GPU, possibly TF sees it as a TPU due to the pluggable device pattern with tensorflow-metal.
因此,我得出结论,由于该模型不支持TPU‘s,但它不能在Mac M2上运行,尽管它被称为GPU,但由于TensorFlow-Metals的可插拔设备模式,TF可能会将其视为TPU。
Further update -- I managed to catch hold of someone from Tensorflow official team and the update is research models are not supported i.e. Tensorflow/models/research section and we are expected to use official models.
进一步更新-我设法从TensorFlow官方团队中找到了一个人,更新的是研究模型不受支持,即TensorFlow/Models/Research部分,我们预计将使用官方模型。
Working Mac M1 gist for TF2 Object detection
TF2目标检测的工作Mac M1要点
更多回答
我是一名优秀的程序员,十分优秀!