gpt4 book ai didi

pandas - 连续而不是离散的集群组 - python

转载 作者:行者123 更新时间:2023-12-05 09:35:17 28 4
gpt4 key购买 nike

我正在尝试以概率方式对一组点进行聚类。使用下面,我有一组 xy 点,它们记录在 XY 中。我想使用引用点聚类成组,引用点显示在 X2Y2 中。

在答案的帮助下,当前的方法是使用 k-means 测量与引用点和组的距离。尽管它提供了一种使用引用点进行聚类的方法,但是硬截断和遵守 k 聚类使得它在处理大量数据集时有些不合适。例如,此示例所需的集群数可能是 3。但单独的示例可能会有所不同。我每次都必须手动检查并更改 k

鉴于 k-means 的非概率性质,一个单独的选项可以是 GMM。建模时是否可以考虑引用点?如果我将输出附加到底层模型下方,则不会像我希望的那样进行聚类。

如果我查看每个点在一个组中的概率,它并没有像我希望的那样聚集在一起。有了这个,我遇到了手动更改组件数量的相同问题。因为点是随机分布的,所以使用“AIC”或“BIC”来选择合适的簇数是行不通的。没有最佳数量。

import pandas as pd
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans

df = pd.DataFrame({
'X' : [-1.0,-1.0,0.5,0.0,0.0,2.0,3.0,5.0,0.0,-2.5,2.0,8.0,-10.5,15.0,-20.0,-32.0,-20.0,-20.0,-10.0,20.5,0.0,20.0,-30.0,-15.0,20.0,-15.0,-10.0],
'Y' : [0.0,1.0,-0.5,0.5,-0.5,0.0,1.0,4.0,5.0,-3.5,-2.0,-8.0,-0.5,-10.5,-20.5,0.0,16.0,-15.0,5.0,13.5,20.0,-20.0,2.0,-17.5,-15,19.0,20.0],
'X2' : [0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0],
'Y2' : [0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0],
})

enter image description here

k-均值:

df['distance'] = np.sqrt(df['X']**2 + df['Y']**2)
df['distance'] = np.sqrt((df['X2'] - df['Y2'])**2 + (df['BallY'] - df['y_post'])**2)

model = KMeans(n_clusters = 2)

model_data = np.array([df['distance'].values, np.zeros(df.shape[0])])
model.fit(model_data.T)
df['group'] = model.labels_

plt.scatter(df['X'], df['Y'], c = model.labels_, cmap = 'bwr', marker = 'o', s = 5)
plt.scatter(df['X2'], df['Y2'], c ='k', marker = 'o', s = 5)

enter image description here

GMM:

Y_sklearn = df[['X','Y']].values

gmm = mixture.GaussianMixture(n_components=3, covariance_type='diag', random_state=42)
gmm.fit(Y_sklearn)
labels = gmm.predict(Y_sklearn)
df['group'] = labels
plt.scatter(Y_sklearn[:, 0], Y_sklearn[:, 1], c=labels, s=5, cmap='viridis');
plt.scatter(df['X2'], df['Y2'], c='red', marker = 'x', edgecolor = 'k', s = 5, zorder = 10)

proba = pd.DataFrame(gmm.predict_proba(Y_sklearn).round(2)).reset_index(drop = True)
df_pred = pd.concat([df, proba], axis = 1)

enter image description here

最佳答案

在我看来,如果你想将集群定义为“点彼此靠近的区域”,你应该使用 DBSCAN .该聚类算法通过查看点彼此靠近的区域(即密集区域)来找到聚类,并通过点密度较低的区域与其他聚类分开。该算法可以将点归类为噪声(异常值)。异常值标记为 -1。它们是不属于任何簇的点。

下面是一些代码,用于执行 DBSCAN 聚类,并将聚类标签作为新的分类列插入到原始 Y_sklearn DataFrame 中。它还打印找到了多少聚类和多少离群值。

import numpy as np
import pandas as pd
from sklearn.cluster import DBSCAN


Y_sklearn = df.loc[:, ["X", "Y"]].copy()
n_points = Y_sklearn.shape[0]

dbs = DBSCAN()
labels_clusters = dbs.fit_predict(Y_sklearn)

#Number of found clusters (outliers are not considered a cluster).
n_clusters = labels_clusters.max() + 1
print(f"DBSCAN found {n_clusters} clusters in dataset with {n_points} points.")

#Number of found outliers (possibly no outliers found).
n_outliers = np.count_nonzero((labels_clusters == -1))
if n_outliers:
print(f"{n_outliers} outliers were found.\n")
else:
print(f"No outliers were found.\n")

#Add cluster labels as a new column to original DataFrame.
Y_sklearn["cluster"] = labels_clusters
#Setting `cluster` column to Categorical dtype makes seaborn function properly treat
#cluster labels as categorical, and not numerical.
Y_sklearn["cluster"] = Y_sklearn["cluster"].astype("category")

如果你想绘制结果,我建议你使用Seaborn。下面是一些代码,用于绘制 Y_sklearn DataFrame 的点,并根据它们所属的集群为它们着色。我还定义了一个新的调色板,它只是默认的 Seaborn 调色板,但异常值(带有标签 -1)将为黑色。

import matplotlib.pyplot as plt
import seaborn as sns


name_palette = "tab10"
palette = sns.color_palette(name_palette)
if n_outliers:
color_outliers = "black"
palette.insert(0, color_outliers)
else:
pass
sns.set_palette(palette)


fig, ax = plt.subplots()
sns.scatterplot(data=Y_sklearn,
x="X",
y="Y",
hue="cluster",
ax=ax,
)

使用默认超参数,DBSCAN 算法在您提供的数据中找不到聚类:所有点都被视为异常值,因为没有区域中的点明显更密集。那是你的整个数据集,还是只是一个样本?如果是样本的话,整个数据集的点会多很多,DBSCAN肯定会找到一些高密度的区域。或者您可以尝试调整超参数,特别是 min_sampleseps。如果你想“强制”算法找到更多的集群,你可以减少 min_samples(默认为 5),或增加 eps(默认为 0.5)。当然,最佳的超参数值取决于特定的数据集,但默认值被认为对 DBSCAN 非常好。因此,如果算法将数据集中的所有点都视为异常值,则意味着不存在“自然”聚类!

关于pandas - 连续而不是离散的集群组 - python,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/65947535/

28 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com