gpt4 book ai didi

python - Scikit-learn 凝聚聚类连通性矩阵

转载 作者:太空狗 更新时间:2023-10-29 21:36:16 30 4
gpt4 key购买 nike

我正在尝试使用 sklearn 的凝聚聚类命令执行约束聚类。为了使算法受到约束,它需要一个“连接矩阵”。这被描述为:

The connectivity constraints are imposed via an connectivity matrix: a scipy sparse matrix that has elements only at the intersection of a row and a column with indices of the dataset that should be connected. This matrix can be constructed from a-priori information: for instance, you may wish to cluster web pages by only merging pages with a link pointing from one to another.

我有一个观察对列表,我希望算法将强制它们保留在同一个集群中。我可以将其转换为稀疏的 scipy 矩阵(coocsr),但生成的集群无法强制约束。

一些数据:

import numpy as np
import scipy as sp
import pandas as pd
import scipy.sparse as ss
from sklearn.cluster import AgglomerativeClustering


# unique ids
ids = np.arange(10)

# Pairs that should belong to the same cluster
mustLink = pd.DataFrame([[1, 2], [1, 3], [4, 6]], columns=['A', 'B'])

# Features for training the model
data = pd.DataFrame([
[.0873,-1.619,-1.343],
[0.697456, 0.410943, 0.804333],
[-1.295829, -0.709441, -0.376771],
[-0.404985, -0.107366, 0.875791],
[-0.404985, -0.107366, 0.875791],
[-0.515996, 0.731980, -1.569586],
[1.024580, 0.409148, 0.149408],
[-0.074604, 1.269414, 0.115744],
[-0.006706, 2.097276, 0.681819],
[-0.432196, 1.249149,-1.159271]])

将对转换为“连接矩阵”:

# Blank coo matrix to csr
sm = ss.coo_matrix((len(ids), len(ids)), np.int32).tocsr()
# Insert 1 for connected pairs and diagonals
for i in np.arange(len(mustLink)): # add links to both sides of the matrix
sm[mustLink.loc[i, 'A'], mustLink.loc[i, 'B']] = 1
sm[mustLink.loc[i, 'B'], mustLink.loc[i, 'A']] = 1
for i in np.arange(sm.tocsr()[1].shape[1]): # add diagonals
sm[i,i] = 1
sm = sm.tocoo() # convert back to coo format

训练和拟合凝聚聚类模型:

m = AgglomerativeClustering(n_clusters=6, connectivity=sm)
out = m.fit_predict(X=data)

我收到的警告:

UserWarning: the number of connected components of the connectivity matrix is 7 > 1. Completing it to avoid stopping the tree early. connectivity, n_components = _fix_connectivity(X, connectivity)

除了不祥的警告之外,我希望属于同一个集群的对没有。

这是因为 sklearn 算法不是为处理 muSTLink 约束而设计的,而是只能使用 distance 矩阵(区别于 here )?

最佳答案

将连接矩阵传递给 sklearn.cluster.AgglomerativeClustering 时,必须连接矩阵中的所有点。凝聚聚类创建了一个层次结构,其中所有点都迭代地组合在一起,因此不存在孤立的集群。连接矩阵对于“关闭”可能在欧几里得空间附近但远离另一个度量的点的连接很有用(请参阅用户指南中的果冻卷示例 here )。

另一种思考方式是你的点必须形成一个不相交的图,你所能做的就是关闭节点之间的边。

这个警告:

UserWarning: the number of connected components of the connectivity matrix is 7 > 1. Completing it to avoid stopping the tree early. connectivity, n_components = _fix_connectivity(X, connectivity)

告诉您您有 7 个不相交的集群,这比允许的 1 个要多。因此 sklearn“完成”它(基本上将其填充为没有不相交的集群),这就是为什么您的约束根本没有得到遵守。

这里没有简单的解决方法。您可以尝试在聚类后重新分配中心以遵守您的约束,否则您将需要采用不同的算法。

关于python - Scikit-learn 凝聚聚类连通性矩阵,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/42821622/

30 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com