gpt4 book ai didi

tensorflow - keras.utils.Sequence 与多个文件

转载 作者:行者123 更新时间:2023-12-02 02:44:55 24 4
gpt4 key购买 nike

我了解如何使用 keras.utils.Sequence一个数据文件。您将 keras.utils.Sequence 子类化类并实现其接口(interface):__len____getitem__ .

例如:

def __len__(self):
"Denotes the number of batches per epoch"
return int(np.ceil(self.no_examples / float(self.batch_size)))

def __getitem__(self, idx):
#build the batch w/ idx and self.batch_size

但是,如果您的数据分布在多个文件中怎么办?例如:
  • train_part1.csv
  • train_part2.csv
  • train_partn.csv

  • 如何只用一个指针遍历所有批处理 idx ?

    最佳答案

    您可以设置 (range, file_path) 的映射

    def __init__(self, file_paths, batch_size):
    self.batch_size = batch_size
    self._mapping = dict()
    count = 0
    for file_path in file_paths:
    with open(file_path, 'r') as f:
    size = len(f.readlines())
    self._mapping[(count, count+size)] = file_path
    count += size
    self.no_examples = count

    def _find_file_path(self, idx):
    for range, file_path in self._mapping.items():
    start, end = range[0], range[1]
    if start <= idx and idx <= end:
    in_file_idx = idx - start
    return (in_file_idx, file_path)

    def __len__(self):
    "Denotes the number of batches per epoch"
    return int(np.ceil(self.no_examples / float(self.batch_size)))

    @functools.lru_cache(maxsize=128) # add memoize for file caching
    def _read_file_data(self, file_path):
    with open(file_path, 'r') as f:
    return list(f.readlines())

    def __getitem__(self, idx):
    in_file_idx, file_path = self._find_file_path(idx)
    lines = self._read_file_data(file_path)
    return lines[in_file_idx]

    进一步优化:
  • 统计内存消耗,删除内存文件内容(如果您的文件太大而无法容纳您的内存),因为您的内存大小;
  • 如果文件数量多,实现更高效的_find_file_path,当前实现是O(n);
  • 关于tensorflow - keras.utils.Sequence 与多个文件,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/55912157/

    24 4 0
    Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
    广告合作:1813099741@qq.com 6ren.com