gpt4 book ai didi

python - 学习 : Mean Distance from Centroid of each cluster

转载 作者:太空狗 更新时间:2023-10-29 21:38:57 25 4
gpt4 key购买 nike

如何找到从质心到每个簇中所有数据点的平均距离。我能够从每个簇的质心找到每个点(在我的数据集中)的欧氏距离。现在我想找到从质心到每个集群中所有数据点的平均距离。计算与每个质心的平均距离的好方法是什么?到目前为止,我已经这样做了..

def k_means(self):
data = pd.read_csv('hdl_gps_APPLE_20111220_130416.csv', delimiter=',')
combined_data = data.iloc[0:, 0:4].dropna()
#print combined_data
array_convt = combined_data.values
#print array_convt
combined_data.head()


t_data=PCA(n_components=2).fit_transform(array_convt)
#print t_data
k_means=KMeans()
k_means.fit(t_data)
#------------k means fit predict method for testing purpose-----------------
clusters=k_means.fit_predict(t_data)
#print clusters.shape
cluster_0=np.where(clusters==0)
print cluster_0

X_cluster_0 = t_data[cluster_0]
#print X_cluster_0


distance = euclidean(X_cluster_0[0], k_means.cluster_centers_[0])
print distance


classified_data = k_means.labels_
#print ('all rows forst column........')
x_min = t_data[:, 0].min() - 5
x_max = t_data[:, 0].max() - 1
#print ('min is ')
#print x_min
#print ('max is ')
#print x_max

df_processed = data.copy()
df_processed['Cluster Class'] = pd.Series(classified_data, index=df_processed.index)
#print df_processed

y_min, y_max = t_data[:, 1].min(), t_data[:, 1].max() + 5
xx, yy = np.meshgrid(np.arange(x_min, x_max, 1), np.arange(y_min, y_max, 1))

#print ('the mesh grid is: ')

#print xx
Z = k_means.predict(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)

plt.figure(1)
plt.clf()
plt.imshow(Z, interpolation='nearest',
extent=(xx.min(), xx.max(), yy.min(), yy.max()),
cmap=plt.cm.Paired,
aspect='auto', origin='lower')


#print Z


plt.plot(t_data[:, 0], t_data[:, 1], 'k.', markersize=20)
centroids = k_means.cluster_centers_
inert = k_means.inertia_
plt.scatter(centroids[:, 0], centroids[:, 1],
marker='x', s=169, linewidths=3,
color='w', zorder=8)
plt.xlim(x_min, x_max)
plt.ylim(y_min, y_max)
plt.xticks(())
plt.yticks(())
plt.show()

简而言之,我想计算特定集群中所有数据点与该集群质心的平均距离,因为我需要根据这个平均距离清理我的数据

最佳答案

这是一种方法。如果您想要除欧几里得之外的其他距离度量,您可以在函数中用另一个距离度量替换 k_mean_distance()

计算每个分配的簇和簇中心的数据点之间的距离,并返回平均值。

距离计算函数:

def k_mean_distance(data, cx, cy, i_centroid, cluster_labels):
# Calculate Euclidean distance for each data point assigned to centroid
distances = [np.sqrt((x-cx)**2+(y-cy)**2) for (x, y) in data[cluster_labels == i_centroid]]
# return the mean value
return np.mean(distances)

对于每个质心,使用该函数获取平均距离:

total_distance = []
for i, (cx, cy) in enumerate(centroids):
# Function from above
mean_distance = k_mean_distance(data, cx, cy, i, cluster_labels)
total_dist.append(mean_distance)

因此,在您的问题的上下文中:

def k_mean_distance(data, cx, cy, i_centroid, cluster_labels):
distances = [np.sqrt((x-cx)**2+(y-cy)**2) for (x, y) in data[cluster_labels == i_centroid]]
return np.mean(distances)

t_data=PCA(n_components=2).fit_transform(array_convt)
k_means=KMeans()
clusters=k_means.fit_predict(t_data)
centroids = km.cluster_centers_

c_mean_distances = []
for i, (cx, cy) in enumerate(centroids):
mean_distance = k_mean_distance(t_data, cx, cy, i, clusters)
c_mean_distances.append(mean_distance)

如果绘制结果plt.plot(c_mean_distances),您应该会看到如下内容:

kmeans clusters vs mean value

关于python - 学习 : Mean Distance from Centroid of each cluster,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/40828929/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com