gpt4 book ai didi

python - 使用自定义指标 : pairwise_distances throwing error 的 sklearn 聚类

转载 作者:行者123 更新时间:2023-11-28 18:57:12 24 4
gpt4 key购买 nike

我想使用我自己的指标对空间数据集进行聚类。数据以数据帧中的成对 (x,y) 值的形式出现,其中每组成对都有一个 id。就像在下面的示例中,我有三组点:

import pandas as pd 
import numpy as np

df = pd.DataFrame({'id': [1] * 4 + [2] * 5 + [3] * 3,
'x': np.random.random(12),
'y': np.random.random(12)})
df['xy'] = df[['x','y']].apply(lambda row: [row['x'],row['y']], axis = 1)

这是我想使用的距离函数:

from scipy.spatial.distance import directed_hausdorff
def some_distance(u, v):
return max(directed_hausdorff(u, v)[0], directed_hausdorff(v, u)[0])

此函数计算 Hausdorff distance ,即 n 维空间的两个子集 uv 之间的距离。在我的例子中,我想使用这个距离函数来聚类真实平面的子集。在上面的数据中有三个这样的子集(ids 从 1 到 3)所以生成的距离矩阵应该是 3x3。

我对聚类步骤的想法是使用 sklearn.cluster.AgglomerativeClustering 和一个预先计算的指标,反过来我想用 sklearn.metrics.pairwise import pairwise_distances

from sklearn.metrics.pairwise import pairwise_distances
def to_np_array(col):
return np.array(list(col.values))
X = df.groupby('id')['xy'].apply(to_np_array).as_matrix()
m = pairwise_distances(X, X, metric=some_distance)

但是,最后一行给我一个错误:

ValueError: setting an array element with a sequence.

然而,工作正常的是调用 some_distance(X[1], X[2])。我的预感是 X 需要采用不同的格式才能使 pairwise_distances 正常工作。关于如何使这项工作或如何自己计算矩阵以便我可以将其粘贴到 sklearn.cluster.AgglomerativeClustering 中的任何想法?

错误栈是

---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-3-e34155622595> in <module>
12 def some_distance(u, v):
13 return max(directed_hausdorff(u, v)[0], directed_hausdorff(v, u)[0])
---> 14 m = pairwise_distances(X, X, metric=some_distance)

C:\ProgramData\Anaconda3\lib\site-packages\sklearn\metrics\pairwise.py in pairwise_distances(X, Y, metric, n_jobs, **kwds)
1430 func = partial(distance.cdist, metric=metric, **kwds)
1431
-> 1432 return _parallel_pairwise(X, Y, func, n_jobs, **kwds)
1433
1434

C:\ProgramData\Anaconda3\lib\site-packages\sklearn\metrics\pairwise.py in _parallel_pairwise(X, Y, func, n_jobs, **kwds)
1065
1066 if effective_n_jobs(n_jobs) == 1:
-> 1067 return func(X, Y, **kwds)
1068
1069 # TODO: in some cases, backend='threading' may be appropriate

C:\ProgramData\Anaconda3\lib\site-packages\sklearn\metrics\pairwise.py in _pairwise_callable(X, Y, metric, **kwds)
1079 """Handle the callable case for pairwise_{distances,kernels}
1080 """
-> 1081 X, Y = check_pairwise_arrays(X, Y)
1082
1083 if X is Y:

C:\ProgramData\Anaconda3\lib\site-packages\sklearn\metrics\pairwise.py in check_pairwise_arrays(X, Y, precomputed, dtype)
106 if Y is X or Y is None:
107 X = Y = check_array(X, accept_sparse='csr', dtype=dtype,
--> 108 warn_on_dtype=warn_on_dtype, estimator=estimator)
109 else:
110 X = check_array(X, accept_sparse='csr', dtype=dtype,

C:\ProgramData\Anaconda3\lib\site-packages\sklearn\utils\validation.py in check_array(array, accept_sparse, accept_large_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, ensure_min_samples, ensure_min_features, warn_on_dtype, estimator)
525 try:
526 warnings.simplefilter('error', ComplexWarning)
--> 527 array = np.asarray(array, dtype=dtype, order=order)
528 except ComplexWarning:
529 raise ValueError("Complex data not supported\n"

C:\ProgramData\Anaconda3\lib\site-packages\numpy\core\numeric.py in asarray(a, dtype, order)
536
537 """
--> 538 return array(a, dtype, copy=False, order=order)
539
540

ValueError: setting an array element with a sequence.

最佳答案

试试这个:

import numpy as np
import pandas as pd
from scipy.spatial.distance import directed_hausdorff
from sklearn.metrics.pairwise import pairwise_distances
from sklearn.cluster import AgglomerativeClustering

df = pd.DataFrame({'id': [1] * 4 + [2] * 5 + [3] * 3, 'x':
np.random.random(12), 'y': np.random.random(12)})
df['xy'] = df[['x','y']].apply(lambda row: [row['x'],row['y']], axis = 1)
df.groupby('id')['xy'].apply(to_np_array)


def some_distance(u, v):
return max(directed_hausdorff(u, v)[0], directed_hausdorff(v, u)[0])


def to_np_array(col):
return np.array(list(col.values))


X = df.groupby('id')['xy'].apply(to_np_array)
d = np.zeros((len(X),len(X)))

for i, u in enumerate(X):
for j, v in list(enumerate(X))[i:]:
d[i,j] = some_distance(u,v)
d[j,i] = d[i,j]

现在,当您打印 d 时,您会得到:

array([[0.        , 0.58928274, 0.40767213],
[0.58928274, 0. , 0.510095 ],
[0.40767213, 0.510095 , 0. ]])

对于聚类:

cluster = AgglomerativeClustering(n_clusters=2, affinity='precomputed', linkage = 'average')
cluster.fit(d)

关于python - 使用自定义指标 : pairwise_distances throwing error 的 sklearn 聚类,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/56913478/

24 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com