python - 在 Python 中计算 Kullback–Leibler 散度的有效方法-6ren

python - 在 Python 中计算 Kullback–Leibler 散度的有效方法

转载作者：太空狗更新时间：2023-10-30 00:35:27

我必须计算 Kullback-Leibler Divergence (KLD) 在数千个离散概率向量之间。目前我正在使用以下代码，但它对我的目的来说太慢了。我想知道是否有更快的方法来计算 KL 散度？

import numpy as np
import scipy.stats as sc

    #n is the number of data points
    kld = np.zeros(n, n)
        for i in range(0, n):
            for j in range(0, n):
                if(i != j):
                    kld[i, j] = sc.entropy(distributions[i, :], distributions[j, :])

最佳答案

Scipy 的 stats.entropy在其默认意义上，邀请输入作为一维数组为我们提供一个标量，这在列出的问题中已经完成。在内部这个函数也允许 broadcasting ，我们可以在这里滥用以获得矢量化解决方案。

来自docs -

scipy.stats.entropy(pk, qk=None, base=None)

If only probabilities pk are given, the entropy is calculated as S = -sum(pk * log(pk), axis=0).

If qk is not None, then compute the Kullback-Leibler divergence S = sum(pk * log(pk / qk), axis=0).

在我们的例子中，我们针对所有行对每一行进行这些熵计算，执行总和缩减以在这两个嵌套循环的每次迭代中获得一个标量。因此，输出数组的形状为 (M,M)，其中 M 是输入数组中的行数。

现在，这里要注意的是 stats.entropy() 会沿 axis=0 求和，因此我们将为其提供两个版本的 分布，他们都将行维度带到 axis=0 以沿其减少，另外两个轴交错 - (M,1) & (1,M) 使用广播 给我们一个(M,M) 形状的输出数组。

因此，解决我们的案例的矢量化且更有效的方法是 -

from scipy import stats
kld = stats.entropy(distributions.T[:,:,None], distributions.T[:,None,:])

运行时测试和验证 -

In [15]: def entropy_loopy(distrib):
    ...:     n = distrib.shape[0] #n is the number of data points
    ...:     kld = np.zeros((n, n))
    ...:     for i in range(0, n):
    ...:         for j in range(0, n):
    ...:             if(i != j):
    ...:                 kld[i, j] = stats.entropy(distrib[i, :], distrib[j, :])
    ...:     return kld
    ...: 

In [16]: distrib = np.random.randint(0,9,(100,100)) # Setup input

In [17]: out = stats.entropy(distrib.T[:,:,None], distrib.T[:,None,:])

In [18]: np.allclose(entropy_loopy(distrib),out) # Verify
Out[18]: True

In [19]: %timeit entropy_loopy(distrib)
1 loops, best of 3: 800 ms per loop

In [20]: %timeit stats.entropy(distrib.T[:,:,None], distrib.T[:,None,:])
10 loops, best of 3: 104 ms per loop