gpt4 book ai didi

互信息的 Python scikit-learn 实现不适用于不同大小的分区

转载 作者:行者123 更新时间:2023-12-05 06:38:52 27 4
gpt4 key购买 nike

我想与不同大小的集合 S 的分区/集群(P1 和 P2)进行比较。示例:

S = [1, 2, 3, 4, 5, 6]
P1 = [[1, 2], [3,4], [5,6]]
P2 = [ [1,2,3,4], [5, 6]]

据我所知,互信息可能是一种方法,它在 scikit-learn 中实现。根据定义,它没有说明分区必须具有相同的大小(http://scikit-learn.org/stable/modules/generated/sklearn.metrics.mutual_info_score.html).l

但是,当我尝试实现我的代码时,由于大小不同而出现错误。

from sklearn import metrics
P1 = [[1, 2], [3,4], [5,6]]
P2 = [ [1,2,3,4], [5, 6]]
metrics.mutual_info_score(P1,P2)


ValueErrorTraceback (most recent call last)
<ipython-input-183-d5cb8d32ce7d> in <module>()
2 P2 = [ [1,2,3,4], [5, 6]]
3
----> 4 metrics.mutual_info_score(P1,P2)

/home/user/anaconda2/lib/python2.7/site-packages/sklearn/metrics/cluster/supervised.pyc in mutual_info_score(labels_true, labels_pred, contingency)
556 """
557 if contingency is None:
--> 558 labels_true, labels_pred = check_clusterings(labels_true, labels_pred)
559 contingency = contingency_matrix(labels_true, labels_pred)
560 contingency = np.array(contingency, dtype='float')

/home/user/anaconda2/lib/python2.7/site-packages/sklearn/metrics/cluster/supervised.pyc in check_clusterings(labels_true, labels_pred)
34 if labels_true.ndim != 1:
35 raise ValueError(
---> 36 "labels_true must be 1D: shape is %r" % (labels_true.shape,))
37 if labels_pred.ndim != 1:
38 raise ValueError(

ValueError: labels_true must be 1D: shape is (3, 2)

有没有一种形式可以使用 scikit-learn 和互信息来查看这个分区有多接近?否则,有没有不使用互信息的?

最佳答案

错误的形式是将信息传递给函数。正确的形式是为要分区的全局集合的每个元素给出一个标签列表。在这种情况下,S 中的每个元素都有一个标签。每个标签应该对应于它所属的集群,因此具有相同标签的元素在同一集群中。解决这个例子:

S = [1, 2, 3, 4, 5, 6]
P1 = [[1, 2], [3,4], [5,6]]
P2 = [ [1,2,3,4], [5, 6]]
labs_1 = [ 1, 1, 2, 2, 3, 3]
labs_2 = [1, 1, 1, 1, 2, 2]
metrics.mutual_info_score(labs_1, labs_2)

答案是:

0.636514168294813

如果我们想计算原始给定分区格式的互信息,那么可以使用以下代码:

from sklearn import metrics
from __future__ import division
import numpy as np

S = [1, 2, 3, 4, 5, 6]
P1 = [[1, 2], [3,4], [5,6]]
P2 = [ [1,2,3,4], [5, 6]]
set_partition1 = [set(p) for p in P1]
set_partition2 = [set(p) for p in P2]

def prob_dist(clustering, cluster, N):
return len(clustering[cluster])/N

def prob_joint_dist(clustering1, clustering2, cluster1, cluster2, N):
'''
N(int): total number of elements.
clustering1(list): first partition
clustering2(list): second partition
cluster1(int): index of cluster of the first partition
cluster2(int): index of cluster of second partition
'''
c1 = clustering1[cluster1]
c2 = clustering2[cluster2]
n_ij = len(set(c1).intersection(c2))
return n_ij/N

def mutual_info(clustering1, clustering2, N):
'''
clustering1(list): first partition
clustering2(list): second partition
Note for it to work division from __future__ must be imported
'''
n_clas = len(clustering1)
n_com = len(clustering2)
mutual_info = 0
for i in range(n_clas):
for j in range(n_com):
p_i = prob_dist(clustering1, i, N)
p_j = prob_dist(clustering2, j, N)
R_ij = prob_joint_dist(clustering1, clustering2, i, j, N)
if R_ij > 0:
mutual_info += R_ij*np.log( R_ij / (p_i * p_j))
return mutual_info

mutual_info(set_partition1, set_partition2, len(S))

给出了相同的答案:

0.63651416829481278

请注意,我们使用的是自然对数而不是 log2。不过,代码可以很容易地进行调整。

关于互信息的 Python scikit-learn 实现不适用于不同大小的分区,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/45444213/

27 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com