gpt4 book ai didi

python - python中两个巨大多维数组之间的距离和汇总计算加速

转载 作者:太空宇宙 更新时间:2023-11-03 15:39:46 25 4
gpt4 key购买 nike

我只有一年的Python使用经验。我想找到基于两个多维数组 DF_AllDF_On 的汇总统计信息。两者都有 XY 值。创建一个函数,将距离计算为 sqrt((X-X0)^2 + (Y-Y0)^2) 并生成摘要,如下面的代码所示。我的问题是:有什么办法可以让这段代码运行得更快吗?我更喜欢原生 python 方法,但也欢迎其他策略(例如 numba)。

下面的示例(玩具)代码在我的 windows-7 x64 桌面上运行只需 50 毫秒。但我的 DF_All 有超过 10,000 行,我需要进行大量的计算,并导致大量的执行时间。

import numpy as np
import pandas as pd
import json, random

# create data
KY = ['ER','WD','DF']
DS = ['On','Off']

DF_All = pd.DataFrame({'KY': np.random.choice(KY,20,replace = True),
'DS': np.random.choice(DS,20,replace = True),
'X': random.sample(range(1,100),20),
'Y': random.sample(range(1,100),20)})


DF_On = DF_All[DF_All['DS']=='On']

# function
def get_values(DF_All,X = list(DF_On['X'])[0],Y = list(DF_On['Y'])[0]):
dist_vector = np.sqrt((DF_All['X'] - X)**2 + (DF_All['Y'] - Y)**2) # computes distance

DF_All = DF_All[dist_vector<35] # filters if distance is < 35
# print(DF_All.shape)

DS_summary = [sum(DF_All['DS']==x) for x in ['On','Off']] # get summary
KY_summary = [sum(DF_All['KY']==x) for x in ['ER','WD','DF']] # get summary

joined_summary = DS_summary + KY_summary # join two summary lists
return(joined_summary) # return

Array_On = DF_On.values.tolist() # convert to array then to list
Values = [get_values(DF_All,ZZ[2],ZZ[3]) for ZZ in Array_On] # list comprehension to get DS and KY summary for all rows of Array_On list

Array_Updated = [x + y for x,y in zip(Array_On,Values)] # appending the summary list to Array_On list
Array_Updated = pd.DataFrame(Array_Updated) # converting to pandas dataframe
print(Array_Updated)

最佳答案

这是一种通过消除循环来利用矢量化的方法 -

from scipy.spatial.distance import cdist

def get_values_vectorized(DF_All, Array_On):
a = DF_All[['X','Y']].values
b = np.array(Array_On)[:,2:].astype(int)
v_mask = (cdist(b,a) < 35).astype(int)

DF_DS = DF_All.DS.values
DS_sums = v_mask.dot(DF_DS[:,None] == ['On','Off'])

DF_KY = DF_All.KY.values
KY_sums = v_mask.dot(DF_KY[:,None] == ['ER','WD','DF'])
return np.column_stack(( DS_sums, KY_sums ))

使用较少的内存,经过调整的内存 -

def get_values_vectorized_v2(DF_All, Array_On):
a = DF_All[['X','Y']].values
b = np.array(Array_On)[:,2:].astype(int)
v_mask = cdist(a,b) < 35

DF_DS = DF_All.DS.values
DS_sums = [((DF_DS==x)[:,None] & v_mask).sum(0) for x in ['On','Off']]

DF_KY = DF_All.KY.values
KY_sums = [((DF_KY==x)[:,None] & v_mask).sum(0) for x in ['ER','WD','DF']]

out = np.column_stack(( np.column_stack(DS_sums), np.column_stack(KY_sums)))
return out

运行时测试 -

案例 #1:原始样本大小为 20

In [417]: %timeit [get_values(DF_All,ZZ[2],ZZ[3]) for ZZ in Array_On]
100 loops, best of 3: 16.3 ms per loop

In [418]: %timeit get_values_vectorized(DF_All, Array_On)
1000 loops, best of 3: 386 µs per loop

案例 #2:样本大小为 2000

In [420]: %timeit [get_values(DF_All,ZZ[2],ZZ[3]) for ZZ in Array_On]
1 loops, best of 3: 1.39 s per loop

In [421]: %timeit get_values_vectorized(DF_All, Array_On)
100 loops, best of 3: 18 ms per loop

关于python - python中两个巨大多维数组之间的距离和汇总计算加速,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/42257725/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com