gpt4 book ai didi

python - pytorch中DataLoader的洗牌顺序

转载 作者:行者123 更新时间:2023-12-05 03:16:39 25 4
gpt4 key购买 nike

我对pytorch中DataLoader的shuffle order真的是一头雾水。假设我有一个数据集:

datasets = [0,1,2,3,4]

场景一,代码为:

torch.manual_seed(1)

G = torch.Generator()
G.manual_seed(1)

ran_sampler = RandomSampler(data_source=datasets,generator=G)
dataloader = DataLoader(dataset=datasets,sampler=ran_sampler)

洗牌结果为0,4,2,3,1


场景二,代码为:

torch.manual_seed(1)

G = torch.Generator()
G.manual_seed(1)

ran_sampler = RandomSampler(data_source=datasets)
dataloader = DataLoader(dataset=datasets, sampler=ran_sampler, generator=G)

洗牌结果为1,3,4,0,2


场景三,代码为:

torch.manual_seed(1)

G = torch.Generator()
G.manual_seed(1)

ran_sampler = RandomSampler(data_source=datasets, generator=G)
dataloader = DataLoader(dataset=datasets, sampler=ran_sampler, generator=G)

洗牌结果为4,1,3,0,2

谁能解释一下这是怎么回事?

最佳答案

根据你的代码,我做了一点修改(在场景二)和检查:

datasets = [0,1,2,3,4]

torch.manual_seed(1)
G = torch.Generator()
G = G.manual_seed(1)

ran_sampler = RandomSampler(data_source=datasets, generator=G)
dataloader = DataLoader(dataset=datasets, sampler=ran_sampler)
print(id(dataloader.generator)==id(dataloader.sampler.generator))
xs = []
for x in dataloader:
xs.append(x.item())
print(xs)

torch.manual_seed(1)
G = torch.Generator()
G.manual_seed(1)

# this is different from OP's scenario II because in that case the ran_sampler is not initialized with the right generator.
dataloader = DataLoader(dataset=datasets, shuffle=True, generator=G)
print(id(dataloader.generator)==id(dataloader.sampler.generator))
xs = []
for x in dataloader:
xs.append(x.item())
print(xs)

torch.manual_seed(1)
G = torch.Generator()
G.manual_seed(1)


ran_sampler = RandomSampler(data_source=datasets, generator=G)
dataloader = DataLoader(dataset=datasets, sampler=ran_sampler, generator=G)
print(id(dataloader.generator)==id(dataloader.sampler.generator))
xs = []
for x in dataloader:
xs.append(x.item())
print(xs)

输出是:

False
[0, 4, 2, 3, 1]
True
[4, 1, 3, 0, 2]
True
[4, 1, 3, 0, 2]

之所以上面三种看似等价的设置导致不同的结果,是因为DataLoader内部实际使用了两种不同的生成器,其中一种是None,在第一种情况下。

为了说清楚,我们来分析下出处。看来generator不仅决定了DataLoader内部_index_sampler的随机数生成,还影响了_BaseDataLoaderIter 。查看源码

        if sampler is None:  # give default samplers
if self._dataset_kind == _DatasetKind.Iterable:
# See NOTE [ Custom Samplers and IterableDataset ]
sampler = _InfiniteConstantSampler()
else: # map-style
if shuffle:
sampler = RandomSampler(dataset, generator=generator) # type: ignore[arg-type]
else:
sampler = SequentialSampler(dataset) # type: ignore[arg-type]

        self.sampler = sampler
self.batch_sampler = batch_sampler
self.generator = generator

    def _get_iterator(self) -> '_BaseDataLoaderIter':
if self.num_workers == 0:
return _SingleProcessDataLoaderIter(self)
else:
self.check_worker_number_rationality()
return _MultiProcessingDataLoaderIter(self)

class _BaseDataLoaderIter(object):
def __init__(self, loader: DataLoader) -> None:
...
self._index_sampler = loader._index_sampler
  • 场景二和场景三

两种设置是等价的。我们将生成器传递给 DataLoader 并且不指定 samplerDataLoader 使用 generator 自动创建一个 RandomSampler 对象,并将相同的生成器分配给 self.generator

  • 情景一

我们使用正确的生成器将采样器传递给 DataLoader,但没有在 DataLoader.__init__(...) generatorDataLoader 使用给定的采样器初始化采样器,但为 self.generator_BaseDataLoaderIter 对象使用默认生成器 Noneself._get_iterator() 返回。

关于python - pytorch中DataLoader的洗牌顺序,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/74580942/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com