gpt4 book ai didi

python 和 Pandas : iterating over DataFrame twice

转载 作者:太空宇宙 更新时间:2023-11-03 16:47:11 26 4
gpt4 key购买 nike

对 DataFrame 的每一行进行马哈拉诺比斯计算,并计算与 DataFrame 中每隔一行的距离。它看起来像这样:

import pandas as pd
from scipy import linalg
from scipy.spatial.distance import mahalanobis
from pprint import pprint

testa = { 'pid': 'testa', 'a': 25, 'b': .455, 'c': .375 }
testb = { 'pid': 'testb', 'a': 22, 'b': .422, 'c': .402 }
testc = { 'pid': 'testc', 'a': 11, 'b': .389, 'c': .391 }

cats = ['a','b','c']
pids = pd.DataFrame([ testa, testb, testc ])
inverse = linalg.inv(pids[cats].cov().values)
distances = { pid: {} for pid in pids['pid'].tolist() }

for i, p in pids.iterrows():
pid = p['pid']
others = pids.loc[pids['pid'] != pid]
for x, other in others.iterrows():
otherpid = other['pid']
d = mahalanobis(p[cats], other[cats], inverse) ** 2
distances[pid][otherpid] = d

pprint(distances)

它对于这里的三个测试用例工作得很好,但在现实生活中它将运行大约 2000-3000 行,并且使用这种方法需要太长时间。我对 pandas 比较陌生,而且相对于 R,我真的更喜欢 python,所以我想清理一下这个。

如何提高效率?

最佳答案

Doing a mahalanobis calculation for each row of a DataFrame with distances to every other row in the DataFrame.

这基本上在 sklearn.metrics.pairwise.pairwise_distances 中得到解决。 ,因此手工操作能否更高效是值得怀疑的。因此,在这种情况下,怎么样

from sklearn import metrics

>>> metrics.pairwise.pairwise_distances(
pids[['a', 'b', 'c']].as_matrix(),
metric='mahalanobis')
array([[ 0. , 2.15290501, 3.54499647],
[ 2.15290501, 0. , 2.62516666],
[ 3.54499647, 2.62516666, 0. ]])

关于 python 和 Pandas : iterating over DataFrame twice,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/36207560/

26 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com