gpt4 book ai didi

Dataloader/sampler/collator to create batches based on the sample contents (sequence length)(根据样本内容(序列长度)创建批次的数据读取器/采样器/校对器)

翻译 作者:bug小助手 更新时间:2023-10-26 22:30:05 25 4
gpt4 key购买 nike



I am converting someone else's code into a neater torch-y pipeline, using datasets and dataloaders, collate functions and samplers. While I have done such work before, I am not sure how to tackle the following problem.

我正在使用数据集和数据加载器、COLLATE函数和采样器将其他人的代码转换为更整洁的Torch-y流水线。虽然我以前做过这样的工作,但我不确定如何解决以下问题。


The dataset contains sentences as samples. Every samples therefore has a number of words (or tokens), which we can get by naively splitting the sample on white space (sample.split()). Such a dummy dataset can look like this:

该数据集包含作为样本的句子。因此,每个样本都有许多单词(或标记),我们可以通过在空格上简单地拆分样本(sample.plit())来获得这些单词(或标记)。这样的虚拟数据集可能如下所示:


from random import randint

from torch.utils.data import Dataset


class DummyDataset(Dataset):
def __init__(self):
data = []
for _ in range(128):
data.append("hello " * randint(64, 176))
self.data = data

def __len__(self):
return len(self.data)

def __getitem__(self, idx: int):
return self.data[idx]

Now I want to be able to load data so that the max. number of tokens in a batch is not more than 250. That implies that the batch size can differ between iterations. One batch may contain two samples that have no more than 250 tokens in total (for instance 127 + 77) and another can have three (66+66+66). Now, the core functionality for this is rather straightforward. Full example below; not optimized by sorting on length or something but that's okay for this example.

现在我希望能够加载数据,以便最大。一批令牌的数量不超过250个。这意味着不同迭代的批处理大小可能不同。一个批次可以包含两个总共不超过250个令牌的样本(例如127+77),而另一个批次可以包含三个(66+66+66)。现在,它的核心功能相当简单。下面是完整的示例;不是通过按长度或其他方式排序来优化的,但对于这个示例来说,这是可以的。


The question is, how can I integrate this in the PyTorch eco-system? Batch sizes are so often used to indicate the number of samples (like in the dataloader). So where should I plug this in, or what should I subclass, to make this work like a regular dataloader?

问题是,我如何才能将其集成到PyTorch生态系统中?批次大小经常用来表示样本的数量(就像在数据加载器中一样)。那么,我应该将它插入哪里,或者说我应该子类什么,才能使它像常规的数据加载器一样工作?


from random import randint

from torch.utils.data import Dataset

class DummyDataset(Dataset):
def __init__(self):
data = []
for _ in range(128):
data.append("hello " * randint(64, 176))
self.data = data

def __len__(self):
return len(self.data)

def __getitem__(self, idx: int):
return self.data[idx]


if __name__ == '__main__':
dataset = DummyDataset()

def get_batch(max_tokens: int = 250):
data_idxs = list(range(len(dataset)))

batch = []
total_batch_len = 0
while data_idxs:
sample = dataset[data_idxs[0]]
sample_len = len(sample.split())

if total_batch_len + sample_len <= max_tokens:
batch.append(sample)
total_batch_len += sample_len
data_idxs.pop(0)
elif batch:
yield batch
batch = []
total_batch_len = 0

yield batch

# Sanity check that we indeed get all items from the dataset
num_samples = 0
num_batches = 0
for b in get_batch():
num_samples += len(b)
num_batches += 1

print(f"Created {num_batches} batches")
assert num_samples == len(dataset)

Maybe torchtext's Iterator and its batch_size_fn can help but I have no experience with it (where should I add it; is it a dataloader itself or should I still wrap a dataloader around it, etc.).

也许TorchText的迭代器和它的BATCH_SIZE_FN可以提供帮助,但我没有使用它的经验(我应该将它添加到哪里;它本身是一个数据加载器,还是应该仍然用一个数据加载器来包装它,等等)。


更多回答
优秀答案推荐

After reading some source code, it seems that you can just use any iterator in a Dataloader's batch_sampler. So the following works as expected.

在阅读了一些源代码之后,您似乎可以只使用Dataloader的Batch_Sampler中的任何迭代器。因此,下面的工作与预期的一样。


from random import randint

from torch.utils.data import Dataset
from torch.utils.data.dataloader import DataLoader


class DummyDataset(Dataset):
def __init__(self):
data = []
for _ in range(128):
data.append("hello " * randint(64, 176))
self.data = data

def __len__(self):
return len(self.data)

def __getitem__(self, idx: int):
return self.data[idx]


class TokenBatchSampler:
def __init__(self, max_tokens: int = 250):
self.max_tokens = max_tokens
self.batches = []
self._prepare_dataset()

def __len__(self) -> int:
return len(self.batches)

def __iter__(self):
return iter(self.batches)

def _prepare_dataset(self):
data_idxs = list(range(len(dataset)))

batches = []
batch_idxs = []
total_batch_len = 0
while data_idxs:
sample_idx = data_idxs[0]
sample = dataset[sample_idx]
sample_len = len(sample.split())

if total_batch_len + sample_len <= self.max_tokens:
batch_idxs.append(sample_idx)
total_batch_len += sample_len
data_idxs.pop(0)
elif batch_idxs:
batches.append(batch_idxs)
batch_idxs = []
total_batch_len = 0

batches.append(batch_idxs)

self.batches = batches


if __name__ == "__main__":
dataset = DummyDataset()

sampler = TokenBatchSampler()
dataloader = DataLoader(dataset, batch_sampler=sampler)
# Sanity check that we indeed get all items from the dataset
for epoch in range(3):
num_samples = 0
num_batches = 0
for b in dataloader:
num_samples += len(b)
num_batches += 1

print(f"Created {num_batches} batches in epoch {epoch}")
assert num_samples == len(dataset)

print(f"DataLoader length {len(dataloader)}")


更多回答

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com