gpt4 book ai didi

python - 计算从 hdf5 文件进行内存映射的大型 numpy 数组的平均值

转载 作者:行者123 更新时间:2023-12-01 02:31:53 28 4
gpt4 key购买 nike

我在计算 numpy 中的数组平均值时遇到问题,该数组对于 RAM(~100G)来说太大了。

<小时/>

我研究过使用np.memmap,但不幸的是我的数组作为数据集存储在 hdf5 文件中。根据我的尝试,np.memmap 不接受 hdf5 数据集作为输入。
类型错误:强制转换为 Unicode:需要字符串或缓冲区,找到数据集

那么如何才能有效地从磁盘调用np.mean 到内存映射数组呢?当然,我可以循环访问数据集的各个部分,其中每个部分都适合内存。
然而,这感觉太像是一种解决方法,而且我也不确定它是否能实现最佳性能。

<小时/>

这是一些示例代码:

data = np.randint(0, 255, 100000*10*10*10, dtype=np.uint8)
data.reshape((100000,10,10,10)) # typically lot larger, ~100G

hdf5_file = h5py.File('data.h5', 'w')
hdf5_file.create_dataset('x', data=data, dtype='uint8')

def get_mean_image(filepath):
"""
Returns the mean_array of a dataset.
"""
f = h5py.File(filepath, "r")
xs_mean = np.mean(f['x'], axis=0) # memory error with large enough array

return xs_mean

xs_mean = get_mean_image('./data.h5')

最佳答案

正如 hpaulj 在评论中建议的那样,我只是将平均值计算分为多个步骤。
这是我的(简化的)代码,如果它可能对某人有用:

def get_mean_image(filepath):
"""
Returns the mean_image of a xs dataset.
:param str filepath: Filepath of the data upon which the mean_image should be calculated.
:return: ndarray xs_mean: mean_image of the x dataset.
"""
f = h5py.File(filepath, "r")

# check available memory and divide the mean calculation in steps
total_memory = 0.5 * psutil.virtual_memory() # In bytes. Take 1/2 of what is available, just to make sure.
filesize = os.path.getsize(filepath)
steps = int(np.ceil(filesize/total_memory))
n_rows = f['x'].shape[0]
stepsize = int(n_rows / float(steps))

xs_mean_arr = None
for i in xrange(steps):
if xs_mean_arr is None: # create xs_mean_arr that stores intermediate mean_temp results
xs_mean_arr = np.zeros((steps, ) + f['x'].shape[1:], dtype=np.float64)

if i == steps-1: # for the last step, calculate mean till the end of the file
xs_mean_temp = np.mean(f['x'][i * stepsize: n_rows], axis=0, dtype=np.float64)
else:
xs_mean_temp = np.mean(f['x'][i*stepsize : (i+1) * stepsize], axis=0, dtype=np.float64)
xs_mean_arr[i] = xs_mean_temp

xs_mean = np.mean(xs_mean_arr, axis=0, dtype=np.float64).astype(np.float32)

return xs_mean

关于python - 计算从 hdf5 文件进行内存映射的大型 numpy 数组的平均值,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/46727907/

28 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com