gpt4 book ai didi

python - 查询保存为 npz 的 NumPy 数组的 NumPy 数组很慢

转载 作者:太空宇宙 更新时间:2023-11-04 10:18:07 24 4
gpt4 key购买 nike

我生成一个npz文件如下:

import numpy as np
import os

# Generate npz file
dataset_text_filepath = 'test_np_load.npz'
texts = []
for text_number in range(30000):
texts.append(np.random.random_integers(0, 20000,
size = np.random.random_integers(0, 100)))
texts = np.array(texts)
np.savez(dataset_text_filepath, texts=texts)

这给了我这个 ~7MiB npz 文件(基本上只有 1 个变量 texts,它是 Numpy 数组的 NumPy 数组):

enter image description here

我使用 numpy.load() 加载:

# Load data
dataset = np.load(dataset_text_filepath)

如果我这样查询,需要几分钟:

# Querying data: the slow way
for i in range(20):
print('Run {0}'.format(i))
random_indices = np.random.randint(0, len(dataset['texts']), size=10)
dataset['texts'][random_indices]

而如果我如下查询,它只需要不到 5 秒:

# Querying data: the fast way
data_texts = dataset['texts']
for i in range(20):
print('Run {0}'.format(i))
random_indices = np.random.randint(0, len(data_texts), size=10)
data_texts[random_indices]

为什么第二种方法比第一种方法快这么多?

最佳答案

dataset['texts'] 在每次使用时读取文件。 load npz 只返回一个文件加载器,而不是实际数据。它是一个“惰性加载程序”,仅在访问时加载特定数组。 load 文档可能更清晰,但他们说:

- If the file is a ``.npz`` file, the returned value supports the context
manager protocol in a similar fashion to the open function::

with load('foo.npz') as data:
a = data['a']

The underlying file descriptor is closed when exiting the 'with' block.

savez:

 When opening the saved ``.npz`` file with `load` a `NpzFile` object is
returned. This is a dictionary-like object which can be queried for
its list of arrays (with the ``.files`` attribute), and for the arrays
themselves.

help(np.lib.npyio.NpzFile)中有更多详细信息

关于python - 查询保存为 npz 的 NumPy 数组的 NumPy 数组很慢,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/34119752/

24 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com