gpt4 book ai didi

python - 计算大矩阵 (300000 x 70000) 的均值和协方差

转载 作者:太空宇宙 更新时间:2023-11-04 09:41:56 24 4
gpt4 key购买 nike

我正在使用 Numpy 并尝试计算大型矩阵 (300000 x 70000) 的均值和协方差。我有 32GB 大小的可用内存。就计算效率和易于实现而言,此任务的最佳做法是什么?

我目前的实现如下:

def compute_mean_variance(mat, chunk_size):
row_count = mat.row_count
col_count = mat.col_count
# maintain the `x_sum`, `x2_sum` array
# mean(x) = x_sum / row_count
# var(x) = x2_sum / row_count - mean(x)**2
x_sum = np.zeros([1, col_count])
x2_sum = np.zeros([1, col_count])

for i in range(0, row_count, chunk_size):
sub_mat = mat[i:i+chunk_size, :]
# in-memory sub_mat of size chunk_size x num_cols
sub_mat = sub_mat.read().val
x_sum += np.sum(sub_mat, 0)
x2_sum += x2_sum + np.sum(sub_mat**2, 0)
x_mean = x_sum / row_count
x_var = x2_sum / row_count - x_mean ** 2
return x_mean, x_var

有什么改进建议吗?

我发现下面的实现应该更容易理解。它还使用 numpy 来计算列 block 的均值和标准差。因此它应该更高效且数值稳定。

def compute_mean_std(mat, chunk_size):
row_count = mat.row_count
col_count = mat.col_count
mean = np.zeros(col_count)
std = np.zeros(col_count)

for i in xrange(0, col_count, chunk_size):
sub_mat = mat[:, i : i + chunk_size]
# num_samples x chunk_size
sub_mat = sub_mat.read().val
mean[i : i + chunk_size] = np.mean(sub_mat, axis=0)
std[i : i + chunk_size] = np.std(sub_mat, axis=0)

return mean, std

最佳答案

我假设为了计算方差,您使用的是 Wiki 所称的 Naïve algorithm .然而,人们可能会发现:

Because x2_sum / row_count and x_mean ** 2 can be very similar numbers, cancellation can lead to the precision of the result to be much less than the inherent precision of the floating-point arithmetic used to perform the computation. Thus this algorithm should not be used in practice. This is particularly bad if the standard deviation is small relative to the mean.

作为替代方案,您可以使用 two-pass algorithm ,即首先计算均值,然后将其用于计算方差。原则上,这似乎是一种浪费,因为必须对数据进行两次迭代。然而,方差计算中使用的“均值”不必是真实均值,合理的估计(可能仅从第一个 block 计算)就足够了。这将减少到 assumed mean 的方法.

此外,一种可能性是将每个 block 的均值/方差的计算直接委托(delegate)给 numpy,然后将它们组合起来,以便使用 parallel algorithm 获得总体均值/方差。 .

关于python - 计算大矩阵 (300000 x 70000) 的均值和协方差,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/51371402/

24 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com