gpt4 book ai didi

python - 使用 h5py 随机播放 HDF5 数据集

转载 作者:太空宇宙 更新时间:2023-11-03 11:48:21 24 4
gpt4 key购买 nike

我有一个很大的 HDF5 文件 (~30GB),我需要打乱每个数据集中的条目(沿 0 轴)。通过查看 h5py 文档,我无法找到 randomAccessshuffle 功能,但我希望我错过了一些东西。

是否有人对 HDF5 足够熟悉,可以想出一种快速随机打乱数据的方法?

以下是我将以我有限的知识实现的伪代码:

for dataset in datasets:
unshuffled = range(dataset.dims[0])
while unshuffled.length != 0:
if unshuffled.length <= 100:
dataset[:unshuffled.length/2], dataset[unshuffled.length/2:] = dataset[unshuffled.length/2:], dataset[:unshuffled.length/2]
break
else:
randomIndex1 = rand(unshuffled.length - 100)
randomIndex2 = rand(unshuffled.length - 100)

unshuffled.removeRange(randomIndex1..<randomIndex1+100)
unshuffled.removeRange(randomIndex2..<randomIndex2+100)

dataset[randomIndex1:randomIndex1 + 100], dataset[randomIndex2:randomIndex2 + 100] = dataset[randomIndex2:randomIndex2 + 100], dataset[randomIndex1:randomIndex1 + 100]

最佳答案

您可以使用random.shuffle(dataset)。对于配备 Core i5 处理器、8 GB RAM 和 256 GB SSD 的笔记本电脑,这需要 11 分钟多一点的时间。请参阅以下内容:

>>> import os
>>> import random
>>> import time
>>> import h5py
>>> import numpy as np
>>>
>>> h5f = h5py.File('example.h5', 'w')
>>> h5f.create_dataset('example', (40000, 256, 256, 3), dtype='float32')
>>> # set all values of each instance equal to its index
... for i, instance in enumerate(h5f['example']):
... h5f['example'][i, ...] = \
... np.ones(instance.shape, dtype='float32') * i
...
>>> # get file size in bytes
... file_size = os.path.getsize('example.h5')
>>> print('Size of example.h5: {:.3f} GB'.format(file_size/2.0**30))
Size of example.h5: 29.297 GB
>>> def shuffle_time():
... t1 = time.time()
... random.shuffle(h5f['example'])
... t2 = time.time()
... print('Time to shuffle: {:.3f} seconds'.format(str(t2 - t1)))
...
>>> print('Value of first 5 instances:\n{}'
... ''.format(str(h5f['example'][:10, 0, 0, 0])))
Value of first 5 instances:
[ 0. 1. 2. 3. 4.]
>>> shuffle_time()
Time to shuffle: 673.848 seconds
>>> print('Value of first 5 instances after '
... 'shuffling:\n{}'.format(str(h5f['example'][:10, 0, 0, 0])))
Value of first 5 instances after shuffling:
[ 15733. 28530. 4234. 14869. 10267.]
>>> h5f.close()

洗牌几个较小的数据集的性能应该不会比这差。

关于python - 使用 h5py 随机播放 HDF5 数据集,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/33900486/

24 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com