gpt4 book ai didi

python - 层次聚类 Python 3.6 期间的内存错误

转载 作者:太空宇宙 更新时间:2023-11-04 04:38:38 27 4
gpt4 key购买 nike

我有一个相当大的数据集(1841000*32 矩阵),我希望在其上运行层次聚类算法。 sklearn.cluster 中的 AgglomerativeClustering 类和 FeatureAgglomeration 类均出现以下错误。

    ---------------------------------------------------------------------------
MemoryError Traceback (most recent call last)
<ipython-input-10-85ab7b694cf1> in <module>()
1
2
----> 3 mat_red = manifold.SpectralEmbedding(n_components=2).fit_transform(mat)
4 clustering.fit(mat_red,y = None)

~/anaconda3/lib/python3.6/site-packages/sklearn/manifold/spectral_embedding_.py in fit_transform(self, X, y)
525 X_new : array-like, shape (n_samples, n_components)
526 """
--> 527 self.fit(X)
528 return self.embedding_

~/anaconda3/lib/python3.6/site-packages/sklearn/manifold/spectral_embedding_.py in fit(self, X, y)
498 "name or a callable. Got: %s") % self.affinity)
499
--> 500 affinity_matrix = self._get_affinity_matrix(X)
501 self.embedding_ = spectral_embedding(affinity_matrix,
502 n_components=self.n_components,

~/anaconda3/lib/python3.6/site-packages/sklearn/manifold/spectral_embedding_.py in _get_affinity_matrix(self, X, Y)
450 self.affinity_matrix_ = kneighbors_graph(X, self.n_neighbors_,
451 include_self=True,
--> 452 n_jobs=self.n_jobs)
453 # currently only symmetric affinity_matrix supported
454 self.affinity_matrix_ = 0.5 * (self.affinity_matrix_ +

~/anaconda3/lib/python3.6/site-packages/sklearn/neighbors/graph.py in kneighbors_graph(X, n_neighbors, mode, metric, p, metric_params, include_self, n_jobs)
101
102 query = _query_include_self(X, include_self)
--> 103 return X.kneighbors_graph(X=query, n_neighbors=n_neighbors, mode=mode)
104
105

~/anaconda3/lib/python3.6/site-packages/sklearn/neighbors/base.py in kneighbors_graph(self, X, n_neighbors, mode)
482 # construct CSR matrix representation of the k-NN graph
483 if mode == 'connectivity':
--> 484 A_data = np.ones(n_samples1 * n_neighbors)
485 A_ind = self.kneighbors(X, n_neighbors, return_distance=False)
486

~/anaconda3/lib/python3.6/site-packages/numpy/core/numeric.py in ones(shape, dtype, order)
186
187 """
--> 188 a = empty(shape, dtype, order)
189 multiarray.copyto(a, 1, casting='unsafe')
190 return a

MemoryError:

我的 RAM 是 8GB,在 64GB 的系统上运行时出现同样的错误。我意识到层次聚类在计算上很昂贵,不推荐用于大型数据集,但我需要一次创建所有数据的树状图。我正在使用 ORB 功能从一袋视觉单词创建词汇树。如果有任何其他方法可以实现此目的或修复错误的方法,请说明!谢谢。

最佳答案

我在运行凝聚集群时遇到了类似的问题。我的解决方案是使用 train_test_split 在一小部分数据上运行聚类算法,然后使用 KNN 将标签从 AC 扩展到其余数据。工作得相当好,不确定您使用的数据是否适合该处理。我的扩展代码是:

X_train, X_test, y_train, y_test = \
train_test_split(X, y,
test_size=test_size, random_state=42)
AC = AgglomerativeClustering(n_clusters=n_clusters, linkage='ward')
AC.fit(X_train)
labels = AC.labels_

KN = KNeighborsClassifier(n_neighbors=n_neighbors)
KN.fit(X_train,labels)
labels2 = KN.predict(X)

关于python - 层次聚类 Python 3.6 期间的内存错误,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/51129498/

27 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com