gpt4 book ai didi

python - Scipy Sparse - 距离矩阵(Scikit 或 Scipy)

转载 作者:行者123 更新时间:2023-12-02 03:35:14 27 4
gpt4 key购买 nike

我正在尝试计算从 scikit-learn DictVectorizer 返回的 Scipy 稀疏矩阵上的最近邻聚类。 。但是,当我尝试使用 scikit-learn 计算距离矩阵时,我收到一条错误消息,使用 pairwise.euclidean_distances 之间的“欧几里德”距离和pairwise.pairwise_distances 。我的印象是 scikit-learn 可以计算这些距离矩阵。

我的矩阵高度稀疏,形状为:<364402x223209 sparse matrix of type <class 'numpy.float64'>
with 728804 stored elements in Compressed Sparse Row format>
.

我也尝试过诸如pdist之类的方法和kdtree在 Scipy 中,但收到了无法处理结果的其他错误。

任何人都可以向我指出一个可以有效地让我计算距离矩阵和/或最近邻结果的解决方案吗?

一些示例代码:

import numpy as np
from sklearn.feature_extraction import DictVectorizer
from sklearn.neighbors import NearestNeighbors
from sklearn.metrics import pairwise
import scipy.spatial

file = 'FileLocation'
data = []
FILE = open(file,'r')
for line in FILE:
templine = line.strip().split(',')
data.append({'user':str(int(templine[0])),str(int(templine[1])):int(templine[2])})
FILE.close()

vec = DictVectorizer()
X = vec.fit_transform(data)

result = scipy.spatial.KDTree(X)

错误:

Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Library/Frameworks/Python.framework/Versions/3.2/lib/python3.2/site-packages/scipy/spatial/kdtree.py", line 227, in __init__
self.n, self.m = np.shape(self.data)
ValueError: need more than 0 values to unpack

同样,如果我运行:

scipy.spatial.distance.pdist(X,'euclidean')

我得到以下信息:

Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Library/Frameworks/Python.framework/Versions/3.2/lib/python3.2/site-packages/scipy/spatial/distance.py", line 1169, in pdist
[X] = _copy_arrays_if_base_present([_convert_to_double(X)])
File "/Library/Frameworks/Python.framework/Versions/3.2/lib/python3.2/site-packages/scipy/spatial/distance.py", line 113, in _convert_to_double
X = X.astype(np.double)
ValueError: setting an array element with a sequence.

最后,运行NearestNeighbor在 scikit-learn 中使用以下命令会导致内存错误:

nbrs = NearestNeighbors(n_neighbors=10, algorithm='brute')

最佳答案

首先,您不能将 KDTreepdist 与稀疏矩阵一起使用,您必须将其转换为密集矩阵(您的选择,无论是否是您的选择):

>>> X
<2x3 sparse matrix of type '<type 'numpy.float64'>'
with 4 stored elements in Compressed Sparse Row format>

>>> scipy.spatial.KDTree(X.todense())
<scipy.spatial.kdtree.KDTree object at 0x34d1e10>
>>> scipy.spatial.distance.pdist(X.todense(),'euclidean')
array([ 6.55743852])

第二,来自the docs :

Efficient brute-force neighbors searches can be very competitive for small data samples. However, as the number of samples N grows, the brute-force approach quickly becomes infeasible.

您可能想尝试“ball_tree”算法,看看它是否可以处理您的数据。

关于python - Scipy Sparse - 距离矩阵(Scikit 或 Scipy),我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/21085990/

27 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com