Training custom data set model using mask_rcnn_inception from tensorflow model zoo on Macbook pro M2(在Macbook PRO M2上使用TensorFlow Model Zoo中的MASK_rcnn

Training custom data set model using mask_rcnn_inception from tensorflow model zoo on Macbook pro M2(在Macbook PRO M2上使用TensorFlow Model Zoo中的MASK_rcnn_inestation训练定制数据集模型)

转载作者：bug小助手更新时间：2023-10-25 14:08:49

Running into GPU related error while working with latest tensorflow ( 2.13 ) . Please note the test model training provided on tensorflow-metal page to verify my setup works fine.

使用最新的TensorFlow(2.13)时遇到与GPU相关的错误。请注意TensorFlow-Metals页面上提供的测试模型培训，以验证我的设置工作正常。

Please advise.

请指点一下。

Below is the command I used - the script is from [github.com/tensorflow/models][1]

以下是我使用的命令-该脚本来自[gihub.com/TensorFlow/Models][1]

 python3 model_main_tf2.py --model_dir=models/ark_mask_rcnn_inception_resnet_v2 --pipeline_config_path=models/ark_mask_rcnn_inception_resnet_v2/pipeline.config

tensorflow.python.framework.errors_impl.InvalidArgumentError: {{function_node __wrapped__IteratorGetNext_output_types_18_device_/job:localhost/replica:0/task:0/device:GPU:0}} indices[0] = 0 is not in [0, 0)
     [[{{node GatherV2_7}}]]
     [[MultiDeviceIteratorGetNextFromShard]]
     [[RemoteCall]] [Op:IteratorGetNext] name:

The above are the last lines of the error message. below is the full log from the model training script

以上是错误消息的最后几行。以下是模型培训脚本中的完整日志

2023-09-10 20:06:55.580212: I metal_plugin/src/device/metal_device.cc:296] systemMemory: 32.00 GB
2023-09-10 20:06:55.580217: I metal_plugin/src/device/metal_device.cc:313] maxCacheSize: 10.67 GB
2023-09-10 20:06:55.580248: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:303] Could not identify NUMA node of platform GPU ID 0, defaulting to 0. Your kernel may not have been built with NUMA support.
2023-09-10 20:06:55.580265: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:269] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 0 MB memory) -> physical PluggableDevice (device: 0, name: METAL, pci bus id: <undefined>)
2023-09-10 20:06:55.581703: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:303] Could not identify NUMA node of platform GPU ID 0, defaulting to 0. Your kernel may not have been built with NUMA support.
2023-09-10 20:06:55.581712: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:269] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 0 MB memory) -> physical PluggableDevice (device: 0, name: METAL, pci bus id: <undefined>)
INFO:tensorflow:Using MirroredStrategy with devices ('/job:localhost/replica:0/task:0/device:GPU:0',)
I0910 20:06:55.581999 8568659456 mirrored_strategy.py:419] Using MirroredStrategy with devices ('/job:localhost/replica:0/task:0/device:GPU:0',)
INFO:tensorflow:Maybe overwriting train_steps: None
I0910 20:06:55.590664 8568659456 config_util.py:552] Maybe overwriting train_steps: None
INFO:tensorflow:Maybe overwriting use_bfloat16: False
I0910 20:06:55.590721 8568659456 config_util.py:552] Maybe overwriting use_bfloat16: False
WARNING:tensorflow:From /Users/_dga/ml-git/tf-venv/lib/python3.9/site-packages/object_detection/model_lib_v2.py:563: StrategyBase.experimental_distribute_datasets_from_function (from tensorflow.python.distribute.distribute_lib) is deprecated and will be removed in a future version.
Instructions for updating:
rename to distribute_datasets_from_function
W0910 20:06:55.605112 8568659456 deprecation.py:364] From /Users/_dga/ml-git/tf-venv/lib/python3.9/site-packages/object_detection/model_lib_v2.py:563: StrategyBase.experimental_distribute_datasets_from_function (from tensorflow.python.distribute.distribute_lib) is deprecated and will be removed in a future version.
Instructions for updating:
rename to distribute_datasets_from_function
INFO:tensorflow:Reading unweighted datasets: ['annotations/train.record']
I0910 20:06:55.607398 8568659456 dataset_builder.py:162] Reading unweighted datasets: ['annotations/train.record']
INFO:tensorflow:Reading record datasets for input file: ['annotations/train.record']
I0910 20:06:55.607451 8568659456 dataset_builder.py:79] Reading record datasets for input file: ['annotations/train.record']
INFO:tensorflow:Number of filenames to read: 1
I0910 20:06:55.607482 8568659456 dataset_builder.py:80] Number of filenames to read: 1
WARNING:tensorflow:num_readers has been reduced to 1 to match input file shards.
W0910 20:06:55.607504 8568659456 dataset_builder.py:86] num_readers has been reduced to 1 to match input file shards.
WARNING:tensorflow:From /Users/_dga/ml-git/tf-venv/lib/python3.9/site-packages/object_detection/builders/dataset_builder.py:100: parallel_interleave (from tensorflow.python.data.experimental.ops.interleave_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use `tf.data.Dataset.interleave(map_func, cycle_length, block_length, num_parallel_calls=tf.data.AUTOTUNE)` instead. If sloppy execution is desired, use `tf.data.Options.deterministic`.
W0910 20:06:55.610141 8568659456 deprecation.py:364] From /Users/_dga/ml-git/tf-venv/lib/python3.9/site-packages/object_detection/builders/dataset_builder.py:100: parallel_interleave (from tensorflow.python.data.experimental.ops.interleave_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use `tf.data.Dataset.interleave(map_func, cycle_length, block_length, num_parallel_calls=tf.data.AUTOTUNE)` instead. If sloppy execution is desired, use `tf.data.Options.deterministic`.
WARNING:tensorflow:From /Users/_dga/ml-git/tf-venv/lib/python3.9/site-packages/object_detection/builders/dataset_builder.py:235: DatasetV1.map_with_legacy_function (from tensorflow.python.data.ops.dataset_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use `tf.data.Dataset.map()
W0910 20:06:55.618376 8568659456 deprecation.py:364] From /Users/_dga/ml-git/tf-venv/lib/python3.9/site-packages/object_detection/builders/dataset_builder.py:235: DatasetV1.map_with_legacy_function (from tensorflow.python.data.ops.dataset_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use `tf.data.Dataset.map()
WARNING:tensorflow:From /Users/_dga/ml-git/tf-venv/lib/python3.9/site-packages/tensorflow/python/autograph/impl/api.py:459: calling map_fn (from tensorflow.python.ops.map_fn) with dtype is deprecated and will be removed in a future version.
Instructions for updating:
Use fn_output_signature instead
W0910 20:06:56.389322 8568659456 deprecation.py:569] From /Users/_dga/ml-git/tf-venv/lib/python3.9/site-packages/tensorflow/python/autograph/impl/api.py:459: calling map_fn (from tensorflow.python.ops.map_fn) with dtype is deprecated and will be removed in a future version.
Instructions for updating:
Use fn_output_signature instead
WARNING:tensorflow:From /Users/_dga/ml-git/tf-venv/lib/python3.9/site-packages/tensorflow/python/util/dispatch.py:1176: sparse_to_dense (from tensorflow.python.ops.sparse_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Create a `tf.sparse.SparseTensor` and use `tf.sparse.to_dense` instead.
W0910 20:06:58.673335 8568659456 deprecation.py:364] From /Users/_dga/ml-git/tf-venv/lib/python3.9/site-packages/tensorflow/python/util/dispatch.py:1176: sparse_to_dense (from tensorflow.python.ops.sparse_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Create a `tf.sparse.SparseTensor` and use `tf.sparse.to_dense` instead.
WARNING:tensorflow:From /Users/_dga/ml-git/tf-venv/lib/python3.9/site-packages/tensorflow/python/util/dispatch.py:1176: to_float (from tensorflow.python.ops.math_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use `tf.cast` instead.
W0910 20:06:59.748894 8568659456 deprecation.py:364] From /Users/_dga/ml-git/tf-venv/lib/python3.9/site-packages/tensorflow/python/util/dispatch.py:1176: to_float (from tensorflow.python.ops.math_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use `tf.cast` instead.
2023-09-10 20:07:01.205124: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:114] Plugin optimizer for device_type GPU is enabled.
2023-09-10 20:07:01.207747: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:114] Plugin optimizer for device_type GPU is enabled.
Traceback (most recent call last):
  File "/Users/_dga/ml-git/tf-ark/Tensorflow/workspace/training_demo/model_main_tf2.py", line 126, in <module>
    tf.compat.v1.app.run()
  File "/Users/_dga/ml-git/tf-venv/lib/python3.9/site-packages/tensorflow/python/platform/app.py", line 36, in run
    _run(main=main, argv=argv, flags_parser=_parse_flags_tolerate_undef)
  File "/Users/_dga/ml-git/tf-venv/lib/python3.9/site-packages/absl/app.py", line 308, in run
    _run_main(main, args)
  File "/Users/_dga/ml-git/tf-venv/lib/python3.9/site-packages/absl/app.py", line 254, in _run_main
    sys.exit(main(argv))
  File "/Users/_dga/ml-git/tf-ark/Tensorflow/workspace/training_demo/model_main_tf2.py", line 117, in main
    model_lib_v2.train_loop(
  File "/Users/_dga/ml-git/tf-venv/lib/python3.9/site-packages/object_detection/model_lib_v2.py", line 605, in train_loop
    load_fine_tune_checkpoint(
  File "/Users/_dga/ml-git/tf-venv/lib/python3.9/site-packages/object_detection/model_lib_v2.py", line 401, in load_fine_tune_checkpoint
    _ensure_model_is_built(model, input_dataset, unpad_groundtruth_tensors)
  File "/Users/_dga/ml-git/tf-venv/lib/python3.9/site-packages/object_detection/model_lib_v2.py", line 161, in _ensure_model_is_built
    features, labels = iter(input_dataset).next()
  File "/Users/_dga/ml-git/tf-venv/lib/python3.9/site-packages/tensorflow/python/distribute/input_lib.py", line 260, in next
    return self.__next__()
  File "/Users/_dga/ml-git/tf-venv/lib/python3.9/site-packages/tensorflow/python/distribute/input_lib.py", line 264, in __next__
    return self.get_next()
  File "/Users/_dga/ml-git/tf-venv/lib/python3.9/site-packages/tensorflow/python/distribute/input_lib.py", line 325, in get_next
    return self._get_next_no_partial_batch_handling(name)
  File "/Users/_dga/ml-git/tf-venv/lib/python3.9/site-packages/tensorflow/python/distribute/input_lib.py", line 361, in _get_next_no_partial_batch_handling
    replicas.extend(self._iterators[i].get_next_as_list(new_name))
  File "/Users/_dga/ml-git/tf-venv/lib/python3.9/site-packages/tensorflow/python/distribute/input_lib.py", line 1427, in get_next_as_list
    return self._format_data_list_with_options(self._iterator.get_next())
  File "/Users/_dga/ml-git/tf-venv/lib/python3.9/site-packages/tensorflow/python/data/ops/multi_device_iterator_ops.py", line 553, in get_next
    result.append(self._device_iterators[i].get_next())
  File "/Users/_dga/ml-git/tf-venv/lib/python3.9/site-packages/tensorflow/python/data/ops/iterator_ops.py", line 867, in get_next
    return self._next_internal()
  File "/Users/_dga/ml-git/tf-venv/lib/python3.9/site-packages/tensorflow/python/data/ops/iterator_ops.py", line 777, in _next_internal
    ret = gen_dataset_ops.iterator_get_next(
  File "/Users/_dga/ml-git/tf-venv/lib/python3.9/site-packages/tensorflow/python/ops/gen_dataset_ops.py", line 3028, in iterator_get_next
    _ops.raise_from_not_ok_status(e, name)
  File "/Users/_dga/ml-git/tf-venv/lib/python3.9/site-packages/tensorflow/python/framework/ops.py", line 6656, in raise_from_not_ok_status
    raise core._status_to_exception(e) from None  # pylint: disable=protected-access
tensorflow.python.framework.errors_impl.InvalidArgumentError: {{function_node __wrapped__IteratorGetNext_output_types_18_device_/job:localhost/replica:0/task:0/device:GPU:0}} indices[0] = 0 is not in [0, 0)
     [[{{node GatherV2_7}}]]
     [[MultiDeviceIteratorGetNextFromShard]]
     [[RemoteCall]] [Op:IteratorGetNext] name: ```


  [1]: https://github.com/tensorflow/models/blob/master/research/object_detection/model_main_tf2.py

running the setup verification script available on apple tensorflow-metal page i.e.

运行Apple TensorFlow-Metals页面上提供的安装验证脚本，即

import tensorflow as tf

cifar = tf.keras.datasets.cifar100
(x_train, y_train), (x_test, y_test) = cifar.load_data()
model = tf.keras.applications.ResNet50(
    include_top=True,
    weights=None,
    input_shape=(32, 32, 3),
    classes=100,)

loss_fn = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=False)
model.compile(optimizer="adam", loss=loss_fn, metrics=["accuracy"])
model.fit(x_train, y_train, epochs=5, batch_size=64) ```

works fine i.e. detects the device etc.

工作正常，即检测到设备等。

更多回答

优秀答案推荐

This answer / assumption also seems to be incorrect. Training the same model on UBUNTU machine with GPU / CPU also faile with identical error.

这个答案/假设似乎也是不正确的。在使用GPU/CPU的Ubuntu机器上训练相同的模型也失败，并出现相同的错误。

Found this issue listed since 2020 on github issue on github

在GitHub上发现自2020年以来在GitHub问题上列出的此问题

For future reference for self and others -

为了将来自己和他人的参考-

On the same machine I could successfully move ahead with my training for other categories of models and couldn't find any specific response to the question of why this error shows up for this specific model type i.e. mask_rcnn_inception_resnet.

在同一台机器上，我可以成功地继续进行其他类别的模型的培训，但对于为什么这个特定的模型类型(即MASK_RCNN_INVERATION_RESNET)会出现这个错误的问题，我找不到任何具体的回答。

Thus I concluded that since this model is not supported on TPU's yet it cannot run on Mac M2 where though its called a GPU, possibly TF sees it as a TPU due to the pluggable device pattern with tensorflow-metal.

因此，我得出结论，由于该模型不支持TPU‘s，但它不能在Mac M2上运行，尽管它被称为GPU，但由于TensorFlow-Metals的可插拔设备模式，TF可能会将其视为TPU。

Further update -- I managed to catch hold of someone from Tensorflow official team and the update is research models are not supported i.e. Tensorflow/models/research section and we are expected to use official models.

进一步更新-我设法从TensorFlow官方团队中找到了一个人，更新的是研究模型不受支持，即TensorFlow/Models/Research部分，我们预计将使用官方模型。

Working Mac M1 gist for TF2 Object detection

TF2目标检测的工作Mac M1要点

更多回答

文章推荐： NUXT useFetch Call Failed(NUXT useFetch调用失败)

node.js - Mongoose.model vs Connection.model vs Model.model
我对 mongoosejs 中模型的使用感到有些困惑。可以通过这些方式使用 mongoose 创建模型使用 Mongoose var mongoose = require('mongoose');
python - models.py 中的 models.Model 参数到底指的是什么？
我正在看 from django.db import models class Publisher(models.Model): name = models.CharField(max_len
asp.net-mvc-2 - 为什么 model => model.Reason_ID 变成 model =>Convert(model.Reason_ID)
我有自己的 html 帮助器扩展，我用这种方式 model.Reason_ID, Register.PurchaseReason) %> 这样声明的。 public static MvcHtmlS
python - model.to(device) 和 model=model.to(device) 有什么区别？
假设模型原本是存储在CPU上的，然后我想把它移到GPU0上，那么我可以这样做: device = torch.device('cuda:0') model = model.to(device) # o
model-view-controller - MVC : Data Models and View Models
我过去读过一些关于模型的 MVC 建议，指出不应为域和 View 重用相同的模型对象；但我找不到任何人愿意讨论为什么这很糟糕。我认为创建两个单独的模型 - 一个用于域，一个用于 View - 然后在
model - 为什么 model.forward(input) 和 model(input) 之间有不同的输出
我正在使用pytorch构建一个像VGG16这样的简单模型，并且我已经重载了函数forward在我的模型中。我发现每个人都倾向于使用 model(input)得到输出而不是 model.forwar
python - tf.keras.models.model 与 tf.keras.model
tf.keras API 中的 models 是否多余？对于某些情况，即使不使用 models，代码也能正常运行。 keras.models.sequential 和 keras.sequential
docker - 运行docker容器报错: Could not find base path/models/model for servable model
当我尝试使用 docker 镜像运行 docker 容器时遇到问题:tensorflow/serving。我运行命令: docker run --name=tf_serving -it tensor
python - Model.get_model_path(model_name ="model") 抛出错误 : Model not found in cache or in root at
我有一个模型，我用管道注册了它: register_step = PythonScriptStep(name = "Register Model",
model-view-controller - MVC : pass model/model data to a view from a controller?
如果 View 需要访问模型中的数据，您是否认为 Controller 应: a)将模型传递给 View b)将模型的数据传递给 View c)都不；这不应该是 Controller 所关心的。让 V
python - Models.Model 的列表模型字段
我正在寻找一个可以在模型中定义的字段，该字段本质上是一个列表，因为它将用于存储多个字符串值。显然CharField不能使用。最佳答案您正在描述一种多对一的关系。这应该通过一个额外的 Model 进
python - Django + (django-model-utils) : Combining two models/inheriting from two models
我最近了解了 Django 中的模型继承。我使用很棒的包 django-model-utils 取得了巨大的成功。我继承自 TimeStampedModel 和 SoftDeletableModel。
python - 为什么 Keras 在 model.evaluate、model.predicts 和 model.fit 之间给我不同的结果？
我正在使用基于 resnet50 的双输出模型进行项目。一个输出用于回归任务，第二个输出用于分类任务。我的主要问题是关于模型评估。在训练期间，我在验证集的两个输出上都取得了不错的结果: - 综合损失
python - Keras:我可以使用 model.predict 但不使用 model.predict_generator 来预测是否使用 model.fit_generator 训练模型
我是keras的新手。现在，我将使用我使用 model.fit_generator 训练的模型来预测测试图像组。我可以使用 model.predict 吗？不确定如何使用model.predict_g
javascript - 将 Model.ID 绑定(bind)到复选框列表并将 Model.X、Model.Y 等属性发布到 Controller
在 MVC 应用程序中，我加入了多个表并将其从 Controller 返回到 View，如下所示: | EmployeeID | ControlID | DoorAddress | DoorID |
node.js - Cassandra Sails model.count() 有效但 model.find() 和 model.findOne() 无效
我在使用 sails-cassandra 连接系统的 Sails 中有一个 Data 模型。数据。 Data.count({...}).exec() 返回 1，但 Data.find({...}).e
java.lang.IllegalArgumentException : Cannot convert Model. User[ usrId=1 ] 将类 Model.User 键入类 Model.User
我正在使用 PrimeFaces dataTable 开发一个 jsf 页面来显示用户列表。用户存储在 Model.User 类的对象中。
python - Keras错误: Error when checking model target: the list of Numpy arrays that you are passing to your model is not the size the model expected
我正在关注https://www.tensorflow.org/tutorials/keras/basic_classification解决 Kaggle 挑战。但是，我不明白应该将什么样的数据输入
python - 如何使用 model.pb、model.h5 或 model.json 创建 CNN 的 .config 文件？
我是这个领域的新手。那么，你们能帮忙如何为 CNN 创建 .config 文件吗？传递有关如何执行此操作的文档或教程将对我有很大帮助。谢谢大家。最佳答案这个问题对我来说没有多大意义，因为 .co
modeling - 一致的术语 : Modeling, DAE、ODE
我是“物理系统建模”主题的新手。我阅读了一些基础文献，并在 Modelica 和 Simulink/Simscape 中做了一些教程。我想问你，如果我对以下内容理解正确: 符号操作是将微分代数方程组(

bug小助手

个人简介

我是一名优秀的程序员,十分优秀！

作者热门文章

滴滴打车优惠券免费领取

全站热门文章

首页

博学

6Ren·AI

商城

Training custom data set model using mask_rcnn_inception from tensorflow model zoo on Macbook pro M2(在Macbook PRO M2上使用TensorFlow Model Zoo中的MASK_rcnn_inestation训练定制数据集模型)