Dataloader/sampler/collator to create batches based on the sample contents (sequence length)(根据样本内容(序列长度)创建批次的数据读取器/采样器/校对器)-6ren

Dataloader/sampler/collator to create batches based on the sample contents (sequence length)(根据样本内容(序列长度)创建批次的数据读取器/采样器/校对器)

翻译作者：bug小助手更新时间：2023-10-26 22:30:05

I am converting someone else's code into a neater torch-y pipeline, using datasets and dataloaders, collate functions and samplers. While I have done such work before, I am not sure how to tackle the following problem.

我正在使用数据集和数据加载器、COLLATE函数和采样器将其他人的代码转换为更整洁的Torch-y流水线。虽然我以前做过这样的工作，但我不确定如何解决以下问题。

The dataset contains sentences as samples. Every samples therefore has a number of words (or tokens), which we can get by naively splitting the sample on white space (sample.split()). Such a dummy dataset can look like this:

该数据集包含作为样本的句子。因此，每个样本都有许多单词(或标记)，我们可以通过在空格上简单地拆分样本(sample.plit())来获得这些单词(或标记)。这样的虚拟数据集可能如下所示：

from random import randint

from torch.utils.data import Dataset


class DummyDataset(Dataset):
    def __init__(self):
        data = []
        for _ in range(128):
            data.append("hello " * randint(64, 176))
        self.data = data

    def __len__(self):
        return len(self.data)

    def __getitem__(self, idx: int):
        return self.data[idx]

Now I want to be able to load data so that the max. number of tokens in a batch is not more than 250. That implies that the batch size can differ between iterations. One batch may contain two samples that have no more than 250 tokens in total (for instance 127 + 77) and another can have three (66+66+66). Now, the core functionality for this is rather straightforward. Full example below; not optimized by sorting on length or something but that's okay for this example.

现在我希望能够加载数据，以便最大。一批令牌的数量不超过250个。这意味着不同迭代的批处理大小可能不同。一个批次可以包含两个总共不超过250个令牌的样本(例如127+77)，而另一个批次可以包含三个(66+66+66)。现在，它的核心功能相当简单。下面是完整的示例；不是通过按长度或其他方式排序来优化的，但对于这个示例来说，这是可以的。

The question is, how can I integrate this in the PyTorch eco-system? Batch sizes are so often used to indicate the number of samples (like in the dataloader). So where should I plug this in, or what should I subclass, to make this work like a regular dataloader?

问题是，我如何才能将其集成到PyTorch生态系统中？批次大小经常用来表示样本的数量(就像在数据加载器中一样)。那么，我应该将它插入哪里，或者说我应该子类什么，才能使它像常规的数据加载器一样工作？

from random import randint

from torch.utils.data import Dataset

class DummyDataset(Dataset):
    def __init__(self):
        data = []
        for _ in range(128):
            data.append("hello " * randint(64, 176))
        self.data = data

    def __len__(self):
        return len(self.data)

    def __getitem__(self, idx: int):
        return self.data[idx]


if __name__ == '__main__':
    dataset = DummyDataset()

    def get_batch(max_tokens: int = 250):
        data_idxs = list(range(len(dataset)))

        batch = []
        total_batch_len = 0
        while data_idxs:
            sample = dataset[data_idxs[0]]
            sample_len = len(sample.split())

            if total_batch_len + sample_len <= max_tokens:
                batch.append(sample)
                total_batch_len += sample_len
                data_idxs.pop(0)
            elif batch:
                yield batch
                batch = []
                total_batch_len = 0

        yield batch

    # Sanity check that we indeed get all items from the dataset
    num_samples = 0
    num_batches = 0
    for b in get_batch():
        num_samples += len(b)
        num_batches += 1

    print(f"Created {num_batches} batches")
    assert num_samples == len(dataset)

Maybe torchtext's Iterator and its batch_size_fn can help but I have no experience with it (where should I add it; is it a dataloader itself or should I still wrap a dataloader around it, etc.).

也许TorchText的迭代器和它的BATCH_SIZE_FN可以提供帮助，但我没有使用它的经验(我应该将它添加到哪里；它本身是一个数据加载器，还是应该仍然用一个数据加载器来包装它，等等)。

更多回答

优秀答案推荐

After reading some source code, it seems that you can just use any iterator in a Dataloader's batch_sampler. So the following works as expected.

在阅读了一些源代码之后，您似乎可以只使用Dataloader的Batch_Sampler中的任何迭代器。因此，下面的工作与预期的一样。

from random import randint

from torch.utils.data import Dataset
from torch.utils.data.dataloader import DataLoader


class DummyDataset(Dataset):
    def __init__(self):
        data = []
        for _ in range(128):
            data.append("hello " * randint(64, 176))
        self.data = data

    def __len__(self):
        return len(self.data)

    def __getitem__(self, idx: int):
        return self.data[idx]


class TokenBatchSampler:
    def __init__(self, max_tokens: int = 250):
        self.max_tokens = max_tokens
        self.batches = []
        self._prepare_dataset()

    def __len__(self) -> int:
        return len(self.batches)

    def __iter__(self):
        return iter(self.batches)

    def _prepare_dataset(self):
        data_idxs = list(range(len(dataset)))

        batches = []
        batch_idxs = []
        total_batch_len = 0
        while data_idxs:
            sample_idx = data_idxs[0]
            sample = dataset[sample_idx]
            sample_len = len(sample.split())

            if total_batch_len + sample_len <= self.max_tokens:
                batch_idxs.append(sample_idx)
                total_batch_len += sample_len
                data_idxs.pop(0)
            elif batch_idxs:
                batches.append(batch_idxs)
                batch_idxs = []
                total_batch_len = 0

        batches.append(batch_idxs)

        self.batches = batches


if __name__ == "__main__":
    dataset = DummyDataset()

    sampler = TokenBatchSampler()
    dataloader = DataLoader(dataset, batch_sampler=sampler)
    # Sanity check that we indeed get all items from the dataset
    for epoch in range(3):
        num_samples = 0
        num_batches = 0
        for b in dataloader:
            num_samples += len(b)
            num_batches += 1

        print(f"Created {num_batches} batches in epoch {epoch}")
        assert num_samples == len(dataset)

    print(f"DataLoader length {len(dataloader)}")

更多回答

Python:过滤器(函数，序列)和映射(函数，序列)之间的区别
我正在阅读 Python 文档以真正深入了解 Python 语言，并遇到了 filter 和 map 函数。我以前使用过过滤器，但从未使用过映射，尽管我在 SO 上的各种 Python 问题中都见过这
algorithm - 给定一个 preOrder 和 inOrder 序列，可能有多少级阶 BST 序列？
当我尝试打印 BST 的级别顺序时，这个问题提示了我。这是一个 Pre-Order Sequence: 4, 1, 2, 3, 5, 6, 7, 8 In_order Sequence : 1, 2
c++ - 定义函数后 main 出错？ "undefined reference to ' 序列::序列()'"
我的代码在 main(序列测试；)的第一行出现错误，指出它是对 sequence::sequence() 的 undefined reference 。我无法更改 main 中的代码。有谁知道我该如何
latex 序列\/?
这可能很简单，但我在通常的 latex 指南中找不到任何相关内容。在这句话中: {\em hello\/} “\/”的目的是什么？最佳答案这就是所谓的斜体校正。其目的是确保斜体文本后有适当的间距。
Postgresql 序列
当我从 Postgresql 表中删除所有记录，然后尝试重置序列以在插入时开始一个编号为 1 的新记录时，我得到不同的结果: SELECT setval('tblname_id_seq', (SELE
30、MariaDB 序列
在版本10.0.3中，MariaDB引入了一种称为序列的存储引擎。其ad hoc为操作生成整数序列，然后终止。该序列包含正整数，以降序或升序排列，并使用起始，结束和递增值。它不允许在多个查询中
数字的 Groovy 序列
如何在 Groovy 中获取给定数字的序列，例如: def number = 169 // need a method in groovy to find the consecutive number
作为特定复杂类型的扩展的任何类型元素的 xsd 序列
基本上，如果这是 .NET，它看起来像这样: ISomething { string A { get; } int B { get; } } var somethings = new List
非阻塞赋值的 Verilog 序列
说以下代码部分(同一块): A <= 1 A <= 2 变量 A 总是被赋值为 2 吗？还是会出现竞争条件并分配 1 或 2？我对非阻塞赋值的理解是，由硬件在 future 分配变量 A，因此它可能
WiX Action 序列
在运行 WiX 设置时，我正在寻找操作列表及其顺序。不知何故，官方网站似乎没有提供任何信息。基本问题是我想正确安排我的自定义操作。通常我需要使用 regsvr32.exe 注册一个 DLL，而这只能
具有至少一个元素的 F# 序列
F#初学者在这里我想创建一个类型，它是具有至少一个元素的另一种具体类型(事件)的序列。任何其他元素都可以在以后随时添加。通常在 C# 中，我会创建一个具有私有(private) List 和公共(p
sql - 在Oracle中删除所有用户表/序列
作为构建过程和不断发展的数据库的一部分，我试图创建一个脚本，该脚本将删除用户的所有表和序列。我不想重新创建用户，因为这将需要比所允许的更多的权限。我的脚本创建了一个过程来删除表/序列，执行该过程，然
日期和向量的 R 序列
我想恢复两个向量的第一个日期和相同向量的第二个日期之间的日期序列，.... 这是一个例子: dates1 = as.Date(c('2015-10-01', '2015-03-27', '2015-0
SQL ORDER BY(序列)
这个问题已经有答案了: sql ORDER BY multiple values in specific order? (12 个回答) 已关闭 9 年前。我有一个 sql 语句，我想要ORDER
日期和向量的 R 序列
我想恢复两个向量的第一个日期和相同向量的第二个日期之间的日期序列，.... 这是一个例子: dates1 = as.Date(c('2015-10-01', '2015-03-27', '2015-0
java - 如何在java中转义],[序列？
在用java编写代码时，我需要用“],[”分割字符串。下面是我的代码。 try (BufferedReader reader = new BufferedReader(new InputStreamR
数字的 Collatz 序列
这个问题已经有答案了: Project Euler Question 14 (Collatz Problem) (8 个回答) 已关闭 9 年前。我正在尝试查找数字的 Collatz 序列。以下
C++:使用循环和变量模式(序列)
我有一个例程函数process_letter_location(const char& c, string &word)。在我的 main 中，我声明了一系列字符串变量，如下所示: string s
c++ - 最长的多米诺骨牌链/序列
我需要找到最长的多米诺骨牌链，给定一组 12 个随机挑选的多米诺骨牌。我已经递归地生成了多米诺骨牌的所有可能性(使用 0 到 12 的面值有 91 种可能性)。多米诺骨牌由一 block “砖 blo
c++ - 序列 vector
我有这个数据结构 Seq，它继承了类 vector 但有一些额外的功能。使用这个数据结构 Seq 我有这个预定义的数据结构: typedef Seq > MxInt2d; 我现在想要一个包含多个 Mx

bug小助手

个人简介

我是一名优秀的程序员,十分优秀！

作者热门文章

滴滴打车优惠券免费领取

全站热门文章

首页

博学

6Ren·AI

商城

Dataloader/sampler/collator to create batches based on the sample contents (sequence length)(根据样本内容(序列长度)创建批次的数据读取器/采样器/校对器)