python - 计算大矩阵 (300000 x 70000) 的均值和协方差-6ren

python - 计算大矩阵 (300000 x 70000) 的均值和协方差

转载作者：太空宇宙更新时间：2023-11-04 09:41:56

24

4

我正在使用 Numpy 并尝试计算大型矩阵 (300000 x 70000) 的均值和协方差。我有 32GB 大小的可用内存。就计算效率和易于实现而言，此任务的最佳做法是什么？

我目前的实现如下:

def compute_mean_variance(mat, chunk_size):
    row_count = mat.row_count
    col_count = mat.col_count
    # maintain the `x_sum`, `x2_sum` array
    # mean(x) = x_sum / row_count
    # var(x) = x2_sum / row_count - mean(x)**2
    x_sum = np.zeros([1, col_count])
    x2_sum = np.zeros([1, col_count])

    for i in range(0, row_count, chunk_size):
        sub_mat = mat[i:i+chunk_size, :]
        # in-memory sub_mat of size chunk_size x num_cols
        sub_mat = sub_mat.read().val
        x_sum += np.sum(sub_mat, 0)
        x2_sum += x2_sum + np.sum(sub_mat**2, 0)
    x_mean = x_sum / row_count
    x_var = x2_sum / row_count - x_mean ** 2
    return x_mean, x_var

有什么改进建议吗？

我发现下面的实现应该更容易理解。它还使用 numpy 来计算列 block 的均值和标准差。因此它应该更高效且数值稳定。

def compute_mean_std(mat, chunk_size):
    row_count = mat.row_count
    col_count = mat.col_count
    mean = np.zeros(col_count)
    std = np.zeros(col_count)

    for i in xrange(0, col_count, chunk_size):
        sub_mat = mat[:, i : i + chunk_size]
        # num_samples x chunk_size
        sub_mat = sub_mat.read().val
        mean[i : i + chunk_size] = np.mean(sub_mat, axis=0)
        std[i : i + chunk_size] = np.std(sub_mat, axis=0)

    return mean, std

最佳答案

我假设为了计算方差，您使用的是 Wiki 所称的 Naïve algorithm .然而，人们可能会发现:

Because x2_sum / row_count and x_mean ** 2 can be very similar numbers, cancellation can lead to the precision of the result to be much less than the inherent precision of the floating-point arithmetic used to perform the computation. Thus this algorithm should not be used in practice. This is particularly bad if the standard deviation is small relative to the mean.

作为替代方案，您可以使用 two-pass algorithm ，即首先计算均值，然后将其用于计算方差。原则上，这似乎是一种浪费，因为必须对数据进行两次迭代。然而，方差计算中使用的“均值”不必是真实均值，合理的估计(可能仅从第一个 block 计算)就足够了。这将减少到 assumed mean 的方法.

此外，一种可能性是将每个 block 的均值/方差的计算直接委托(delegate)给 numpy，然后将它们组合起来，以便使用 parallel algorithm 获得总体均值/方差。 .

关于python - 计算大矩阵 (300000 x 70000) 的均值和协方差，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/51371402/

24

4

0

文章推荐： java - adView 在换行内容中不可见

文章推荐： linux - 在 linux 中创建新用户

文章推荐： java - 什么是NullPointerException，我该如何解决？

Kotlin 类层次结构和(协)方差
在我的设置中，我试图有一个界面 Table继承自 Map (因为它主要用作 map 的包装器)。两个类继承自 Table - 本地和全局。全局的将有一个可变的映射，而本地的将有一个只有本地条目的映射。
generics - 需要澄清关于 `Box` 、 `Vec` 和其他集合的(协)方差的 Rust Nomicon 部分
Rust Nomicon 有 an entire section on variance除了关于 Box 的这一小节，我或多或少地理解了这一点和 Vec在 T 上(共同)变体. Box and Vec

首页

博学

6Ren·AI

商城

python - 计算大矩阵 (300000 x 70000) 的均值和协方差