用于在 kmeans 聚类后查找特征重要性的 python 代码-6ren

用于在 kmeans 聚类后查找特征重要性的 python 代码

转载作者：行者123 更新时间：2023-12-05 08:12:38

我研究了找到特征重要性的方法(我的数据集只有 9 个特征)。以下是两种方法，但是我很难编写 python 代码。

我希望对影响集群形成的每个特征进行排名。

计算每个维度的质心方差。具有最高方差的维度对于区分聚类最为重要。
如果您只有少量变量，您可以进行某种留一法测试(删除 1 个变量并重做聚类)。另请记住，k-means 取决于初始化，因此您希望在重做聚类时保持不变。

有什么 python 代码可以完成这个吗？

最佳答案

假设我们有包含 200 个样本和 9 个变量的 X，并综合使它们具有两个聚类，为了可视化，我们每次填充其中两个变量。

import numpy as np
import matplotlib.pyplot as plt
import sklearn
X = np.zeros((200,4))

Feature1_1 = np.random.normal(loc=40, scale=1.0, size=100)
Feature1_2 = np.random.normal(loc=70, scale=3.0, size=100)

Feature2_1 = np.random.normal(loc=20, scale=4.0, size=100)
Feature2_2 = np.random.normal(loc=50, scale=1.0, size=100)

X[:100,0]=Feature1_1
X[100:,0]=Feature1_2
X[:100,1]=Feature2_1
X[100:,1]=Feature2_2

plt.figure(figsize = (5,5))
plt.scatter(X[:,0],X[:,1])
plt.grid()
plt.xlabel('Feature 2',fontsize=18)
plt.ylabel('Feature 1',fontsize=18)

现在，让我们填充一个具有更高方差的新特征。

Feature3_1 = np.random.normal(loc=40, scale=300.0, size=100)
Feature3_2 = np.random.normal(loc=43, scale=280.0, size=100)

Feature2_1 = np.random.normal(loc=20, scale=4.0, size=100)
Feature2_2 = np.random.normal(loc=50, scale=1.0, size=100)


X[:100,2]=Feature3_1
X[100:,2]=Feature3_2

X[:100,1]=Feature2_1
X[100:,1]=Feature2_2

plt.figure(figsize = (5,5))
plt.scatter(X[:,2],X[:,1])
plt.grid()
plt.xlabel('Feature 3',fontsize=18)
plt.ylabel('Feature 2',fontsize=18)

最后一个也有更高的方差

Feature3_1 = np.random.normal(loc=40, scale=300.0, size=100)
Feature3_2 = np.random.normal(loc=43, scale=280.0, size=100)

Feature4_1 = np.random.normal(loc=20, scale=40.0, size=100)
Feature4_2 = np.random.normal(loc=22, scale=40.0, size=100)


X[:100,2]=Feature3_1
X[100:,2]=Feature3_2

X[:100,3]=Feature4_1
X[100:,3]=Feature4_2

plt.figure(figsize = (5,5))
plt.scatter(X[:,2],X[:,3])
plt.grid()
plt.xlabel('Feature 3',fontsize=18)
plt.ylabel('Feature 4',fontsize=18)

现在，让我们用 k-means 对它们进行分类

from sklearn.cluster import KMeans
kmeans = KMeans(n_clusters=2, random_state=0).fit(X)

现在，让我们可视化这些集群。

f1=0
f2=1

plt.figure(figsize = (5,5))
plt.scatter(X[kmeans.labels_==0][:,f1],X[kmeans.labels_==0][:,f2])
plt.scatter(X[kmeans.labels_==1][:,f1],X[kmeans.labels_==1][:,f2])
plt.grid()
plt.xlabel('Feature 3',fontsize=18)
plt.ylabel('Feature 2',fontsize=18)

f1=2
f2=1

plt.figure(figsize = (5,5))
plt.scatter(X[kmeans.labels_==0][:,f1],X[kmeans.labels_==0][:,f2])
plt.scatter(X[kmeans.labels_==1][:,f1],X[kmeans.labels_==1][:,f2])
plt.grid()
plt.xlabel('Feature 3',fontsize=18)
plt.ylabel('Feature 2',fontsize=18)

f1=2
f2=3

plt.figure(figsize = (5,5))
plt.scatter(X[kmeans.labels_==0][:,f1],X[kmeans.labels_==0][:,f2])
plt.scatter(X[kmeans.labels_==1][:,f1],X[kmeans.labels_==1][:,f2])
plt.grid()
plt.xlabel('Feature 3',fontsize=18)
plt.ylabel('Feature 2',fontsize=18)

我们现在可以非常清楚地看到功能 3 和 4 是唯一重要的功能。 请注意，标准化特征会导致完全不同的结果。

最后，我们通过以下方式实现自动化:

for feature in range(X.shape[1]):
    mean1 = X[kmeans.labels_==0][:,feature].mean()
    mean2 = X[kmeans.labels_==1][:,feature].mean()
    
    var1 = X[kmeans.labels_==0][:,feature].var()
    var2 = X[kmeans.labels_==1][:,feature].var()
    
    print('feature:',feature,'Mean difference:',round(abs(mean1-mean2),3),'Total Variance:',round((var1+var2),3))

导致:

feature: 0 Mean difference: 1.69 Total Variance: 459.464 
feature: 1 Mean difference: 0.879 Total Variance: 449.829 
feature: 2 Mean difference: 66.213 Total Variance: 154932.184 
feature: 3 Mean difference: 2.076 Total Variance: 2731.953

关于用于在 kmeans 聚类后查找特征重要性的 python 代码，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/61000826/

文章推荐： office365 - Microsoft Graph API 中是否提供电子邮件转发功能？

行者123

个人简介

我是一名优秀的程序员,十分优秀！

作者热门文章

滴滴打车优惠券免费领取

全站热门文章

首页

博学

6Ren·AI

商城

用于在 kmeans 聚类后查找特征重要性的 python 代码