tensorflow - 使用 MirroredStrategy 时，tensorflow Estimator 是否为工作人员采取不同的批处理？-6ren

tensorflow - 使用 MirroredStrategy 时，tensorflow Estimator 是否为工作人员采取不同的批处理？

转载作者：行者123 更新时间：2023-12-01 01:42:57

我正在使用 GANEstimator 和 MirroredStrategy 来处理单个实例的多个 GPU。 input_fn在我的情况下是 tf.data.Dataset使用以下设置:

dataset = dataset.repeat()
dataset = dataset.shuffle(buffer_size=100)
dataset = dataset.batch(self.batch_size, drop_remainder=True)
dataset = dataset.prefetch(100)

我问这个的原因是我需要指定类似 dataset.shard() 的内容吗？手动将不同的数据传递给 worker ？我正在挖掘 Estimator 的代码, 和 MirroredStrategy ，但我不清楚发生了什么。额外的混淆来自 description of distributed strategies :

MirroredStrategy: This does in-graph replication with synchronous 
training on many GPUs on one machine. Essentially, we create copies of all
variables in the model's layers on each device. We then use all-reduce 
to combine gradients across the devices before applying them 
to the variables to keep them in sync.

CollectiveAllReduceStrategy: This is a version of MirroredStrategy 
for multi-worker training.

那么 MirroredStratedy 只使用一名 worker 吗？我不明白。我需要指定批次大小等于一塔的容量，否则我会出现 OOM。有人可以指出我的代码并解释这样一个简单的设置如何处理批处理:

def create_dataset():
    ...
    dataset = dataset.repeat()
    dataset = dataset.shuffle(buffer_size=100)
    dataset = dataset.batch(self.batch_size, drop_remainder=True)
    dataset = dataset.prefetch(100)
    return dataset



NUM_GPUS = 4
strategy = tf.contrib.distribute.MirroredStrategy(num_gpus=NUM_GPUS)

optimizer = tf.train.RMSPropOptimizer(learning_rate=0.01, use_locking=True)
optimizer_d = tf.train.RMSPropOptimizer(learning_rate=0.01, use_locking=True)

config = tf.estimator.RunConfig(save_checkpoints_steps=100, 
          save_summary_steps=1, keep_checkpoint_max=50, 
          train_distribute=strategy)

# I have more hooks here, just simplified to show 
def get_hooks_fn(GANTrainOps):

    disjoint_train_hook_func = tfgan.get_sequential_train_hooks(
                 train_steps=tfgan.GANTrainSteps(10, 1)
                 ) # g steps, d steps
    disjoint_train_hooks = disjoint_train_hook_func(GANTrainOps)
    return [update_hook, summary_hook] + disjoint_train_hooks


# Create GAN estimator.
gan_estimator = tfgan.estimator.GANEstimator(
    model_dir = '/data/checkpoints/estimator_model', 
    generator_fn = generator_fn,
    discriminator_fn = discriminator_fn,
    generator_loss_fn = generator_loss_fn, 
    discriminator_loss_fn = discriminator_loss_fn, 
    generator_optimizer = optimizer,
    discriminator_optimizer = optimizer_d, 
    use_loss_summaries=True,
    config=config,
    get_hooks_fn=get_hooks_fn)


gan_estimator.train(input_fn=create_dataset, steps=10000)

谢谢!

MirroredStrategy 的代码包含:

1)奇怪的措辞:

The multi-worker version of this class maps one replica to one device on a worker. It mirrors all model variables on all replicas. For example, if you have two workers and each worker has 4 GPUs, it will create 8 copies of the model variables on these 8 GPUs. Then like in MirroredStrategy(???), each replica performs their computation with their own copy of variables unless in cross-replica model where variable or tensor reduction happens.

auto_shard_dataset: whether to auto-shard the dataset when there are multiple workers.

该参数默认为 False。

编辑:

到目前为止，我发现 tf.estimator.train() 一段时间后指向似乎是 strategy.make_input_fn_iterator() :

def _get_iterator_from_input_fn(self, input_fn, mode, distribution=None):
    if distribution is not None:
      iterator = distribution.make_input_fn_iterator(
          lambda _: self._call_input_fn(input_fn, mode))
      input_hooks = [
          estimator_util.DistributedIteratorInitializerHook(iterator)]
    else:
      result = self._call_input_fn(input_fn, mode)
      iterator = result.make_initializable_iterator()
      input_hooks = [estimator_util._DatasetInitializerHook(iterator)]  
return iterator, input_hooks

make_input_fn_iterator()
但是从 MirroredStrategy的代码中删除了并且不再存在!我不明白它是如何工作的以及数据集实际拆分的位置。

EDIT2:我找不到行 make_input_fn_iterator在我使用 grep 的 tensorflow 1.12.0 发行版中。似乎它在代码中完全不存在。

最佳答案

好的，花一些时间研究了github，发现已经和我的tf 1.12.0不一样了。所以，进入 1.12.0 的本地文件给了我:

GANEstimator 继承了 tf.python.estimator.Estimator

Estimator.init():

# The distribute field contains an instance of DistributionStrategy.
    self._train_distribution = self._config.train_distribute

然后向下的路径是:

tf.contrib.gan.GANEstimator -> tf.python.estimator.Estimator.train() --> 
tf.python.estimator.Estimator._train_model(input_fn, hooks, saving_listeners) --> 
._train_model_distributed(input_fn, hooks, saving_listeners) --> 
._get_iterator_from_input_fn(input_fn, model_fn_lib.ModeKeys.TRAIN, self._train_distribution) --> 
distribution.distribute_dataset(lambda: self._call_input_fn(input_fn, mode))

在我的情况下，它要求 MirrorredStrategy.distribute_dataset():

def distribute_dataset(self, dataset_fn):
    if self._cluster_spec:
      return values.MultiWorkerDataset(
          partial(self._call_dataset_fn, dataset_fn), self._worker_device_map,
          self._prefetch_on_device, self._auto_shard_dataset)
    else:
      return values.PerDeviceDataset(
          self._call_dataset_fn(dataset_fn), self._devices,
          self._prefetch_on_device)

tensorflow/python/training/distribute.py :

  def _call_dataset_fn(self, dataset_fn):
    result = dataset_fn()
    if not isinstance(result, dataset_ops.Dataset):
      raise ValueError(
          "dataset_fn() must return a tf.data.Dataset when using a "
          "DistributionStrategy.")
    return result

我假设 PerDeviceDataset使用了，所以最后我在 values.py中找到了这两个类:

class PerDeviceDataset(object):
  """Like `tf.data.Dataset` split devices, producing `PerDevice` data."""

  def __init__(self, dataset, devices, prefetch_on_device=None):
    self._devices = devices

    # Default to using prefetching in graph mode, unless specified.
    # TODO(priyag): Enable prefetching in eager mode.
    self._prefetch_on_device = prefetch_on_device
    if self._prefetch_on_device is None:
      self._prefetch_on_device = not context.executing_eagerly()
    assert not (self._prefetch_on_device and context.executing_eagerly()), (
        "Prefetching is only supported in graph mode currently")

    if self._prefetch_on_device:
      self._dataset = dataset.apply(
          prefetching_ops_v2.prefetch_to_devices(self._devices))
    else:
      # TODO(priyag): If dropping remainder is not appropriate, find another
      # approach to distributing the dataset when not possible to divide evenly.
      # Possibly not an issue when we start using PartitionedDataset.
      self._dataset = dataset.batch(len(devices), drop_remainder=True)

  def make_one_shot_iterator(self):
    """Get a one time use iterator for the distributed PerDeviceDataset."""
    dataset_iterator = self._dataset.make_one_shot_iterator()
    return PerDeviceDataIterator(dataset_iterator, self._devices,
                                 self._prefetch_on_device)

  def make_initializable_iterator(self):
    """Get an initializable iterator for the distributed PerDeviceDataset."""
    dataset_iterator = self._dataset.make_initializable_iterator()
    return PerDeviceDataIterator(dataset_iterator, self._devices,
                                 self._prefetch_on_device)


class PerDeviceDataIterator(object):
  """An iterator (like `tf.data.Iterator`) into a `PerDeviceDataset`."""

  def __init__(self, iterator, devices, prefetch_on_device=None):
    self._iterator = iterator
    self._devices = devices
    self._prefetch_on_device = prefetch_on_device

  @property
  def initializer(self):
    return self._iterator.initializer

  def get_next(self, name=None):
    """Scatter the input across devices."""
    if self._prefetch_on_device:
      data_list = self._iterator.get_next(name=name)
      index = dict(zip(self._devices, data_list))
    else:
      batch = self._iterator.get_next(name=name)
      index = {}
      def get_ith(i):
        return lambda x: x[i]

      for i, d in enumerate(self._devices):
        index[d] = nest.map_structure(get_ith(i), batch)
        if context.executing_eagerly():
          with ops.device(d):
            index[d] = nest.map_structure(array_ops.identity, index[d])

    return regroup(index)

所以，据我所知，首先是我的 dataset_fn()只是调用函数来获取数据集对象，然后在其上应用大小为 GPU 数量的批处理。该批次的元素必须是在我的数据集初始化中定义的实际批次 dataset_fn()分配给不同的设备。

关于tensorflow - 使用 MirroredStrategy 时，tensorflow Estimator 是否为工作人员采取不同的批处理？，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/54327610/

文章推荐： c++ - 是否有利用 move 的 back_inserter 变体？

文章推荐： javascript - Props 在我的 Vue.js 应用程序中不起作用

文章推荐： sql-server - 如何为 SQL Server Windows 身份验证配置 Squirrel SQL

java - 采取|| (或)语句作为输入？
我的类有一个 foo 方法和一个 main 方法，其中有一些变量和一个 print 语句。 public static boolean foo(int x, boolean b) { if (
python - 采取 Pandas (python)中每隔一列的行均值
我正在尝试对每几列取行平均值。这是一个示例数据集。 d = {'2000-01': range(0,10), '2000-02': range(10,20), '2000-03': range(10,
cuda - CUDA中分支的概念(采取、不采取、发散)
在 Nsight Visual Studio 中，我们将有一个图表来呈现“已采取”、“未采取”和“分歧”分支的统计信息。我对“不采取”和“分歧”之间的区别感到困惑。例如 kernel() { if
cuda - CUDA中分支的概念(采取、不采取、发散)
在 Nsight Visual Studio 中，我们将有一个图表来呈现“已采取”、“未采取”和“分歧”分支的统计信息。我对“不采取”和“分歧”之间的区别感到困惑。例如 kernel() { if
c - 采取 long int 时的可疑指针转换
int main() { long int i,t,n,q[500],d[500],s[500],res[500]={0},j,h; scanf("%ld",&t); whil
stream - Racket :采取:违反契约(Contract)
我在 Linux 上使用 racket v6.5 repl 并尝试运行流教程中的 take 函数示例 https://docs.racket-lang.org/functional-data-stru
r - 在 ggpairs 中加入独立的图例(采取 2)
tl;博士无法在 ggpairs 中获得独立的图例(描述整个情节的常用颜色)令我满意。对不起，长度。我正在尝试使用 GGally::ggpairs 绘制(下三角形)对图(用于绘制各种绘图矩阵的扩展
JQuery 根据点击显示带有 id 的图像(采取 2)
几个月前我问过this question 。我想添加一个具有不同背景的相同 div。我想知道为什么 jQuery 在第二个 div 中不起作用？我发现仅当我单击第二个 div 中的小图像时，图像才会在
python - 在 django 中执行右连接(采取 2)
引用Performing a right join in django ，当我尝试类似的方法时(字段略有不同): class Student: user = ForeignKey(User)
ios - 采取 UIAlertAction 后 View 未关闭
所以我使用带有 Action Sheet 样式的 UIAlertController 来显示两个选项，一个用于取消操作，另一个用于删除数据。按钮工作正常，删除按钮工作，操作表关闭。我的问题是，在后台从
jQuery/jQueryUI Droppable 采取 Draggable 的形式
我有一个列表，其中每个单元格都是一个可放置的对象，可以接受某个类的可拖动对象。该表的边框是可见的，但我不希望固定大小的单元格着色且可见，这对我来说很难看。当我拖动一个可拖动对象与一个单元格相交时，该单
apache-spark - 缓存后立即“采取”操作 RDD 仅导致 2% 的缓存
我有一个 RDD，它是通过读取一个大小约为 117MB 的本地文本文件形成的。 scala> rdd res87: org.apache.spark.rdd.RDD[String] = MapPart
algorithm - n 步，采取 1、2 或 3 步。有多少种方式可以登顶？
如果我们有 n 级台阶并且我们可以一次上 1 或 2 级台阶，则台阶数和攀登台阶的方式之间存在斐波那契关系。当且仅当我们不认为 2+1 和 1+2 不同。但是，情况不再如此，我们还必须添加第三个选项
c# - 为什么这个 Linq 不起作用(将 Linq 表达式转换为 URI : Can only specify query options (orderby, 的错误，其中，采取，跳过)
var query = from ch in Client.wcf.context.CashHeading where ch.Id_customer == customern//cc.Id

行者123

个人简介

我是一名优秀的程序员,十分优秀！

作者热门文章

滴滴打车优惠券免费领取

全站热门文章

首页

博学

6Ren·AI

商城

tensorflow - 使用 MirroredStrategy 时，tensorflow Estimator 是否为工作人员采取不同的批处理？