gpt4 book ai didi

python - 提高 python 代码的性能,旨在使用大型数据集查找特定区间内的数字

转载 作者:太空宇宙 更新时间:2023-11-04 10:16:54 25 4
gpt4 key购买 nike

我有一个字典如下(这里为简单起见,只给出一个key):

intervals={'Sca1': [[1428, 1876, 0.0126525], [1876, 1883, 0.0126525], [1883, 1939, 0.0126525], [1939, 1956, 0.0126525], [1956, 2032, 0.0126525], [2154, 3067, 0.0126525], [3067, 3438, 0.0126525], [3438, 3575, 0.0126525], [4301, 4610, 0.0126525], [4610, 4694, 0.0126525], [4694, 5163, 0.0126525], [5163, 5164, 0.0126525], [5164, 5530, 0.013], [5530, 5858, 0.0127005]]}

和如下列表:

snplist = [1786, 2463, 2907, 3068, 3086, 3398, 5468, 5531, 5564, 5580]

我想检查 snplist 中的每个值,是否位于字典值的子列表的前两个值之间的区间内。例如,1786 介于 [1428, 1876, 0.0126525]14281878 之间。如果是,则打印该子列表的索引,在本例中为 0,snplist 中的元素,在本例中为 1786,以及子列表中的第三个值,在本例中,0.0126525。我写了下面的代码:

output=[]
for element in snplist:
for key, value in intervals.items():
for left, right, rho in value:
if left <= element <= right:
output.append([value.index([left, right, rho]), element, rho])
print 'output', output, '\n'

输出是:

[[0, 1786, 0.0126525], [5, 2463, 0.0126525], [5, 2907, 0.0126525], [6, 3068, 0.0126525], [6, 3086, 0.0126525], [6, 3398, 0.0126525], [12, 5468, 0.013], [13, 5531, 0.0127005], [13, 5564, 0.0127005], [13, 5580, 0.0127005]]

此代码适用于这个小数据集,但当我将它用于非常大的数据集时,它变得非常慢。我按如下方式使用列表理解:

output = [[value.index([left, right, rho]), element, rho]
for element in snplist
for key, value in intervals.items()
for left, right, rho in value
if left <= element <= right]

但这并没有改善。例如,关于如何通过减少 for 循环的数量来提高代码速度有什么建议吗?谢谢!

最佳答案

如果您可以将 dict 值转换为 numpy 数组,则可以加快速度:

数据:

intervals_numpy = {'Sca1': np.array([[1428, 1876, 0.0126525], [1876, 1883, 0.0126525], [1883, 1939, 0.0126525], [1939, 1956, 0.0126525], [1956, 2032, 0.0126525], [2154, 3067, 0.0126525], [3067, 3438, 0.0126525], [3438, 3575, 0.0126525], [4301, 4610, 0.0126525], [4610, 4694, 0.0126525], [4694, 5163, 0.0126525], [5163, 5164, 0.0126525], [5164, 5530, 0.013], [5530, 5858, 0.0127005]])}

intervals_list = {'Sca1': [[1428, 1876, 0.0126525], [1876, 1883, 0.0126525], [1883, 1939, 0.0126525], [1939, 1956, 0.0126525], [1956, 2032, 0.0126525], [2154, 3067, 0.0126525], [3067, 3438, 0.0126525], [3438, 3575, 0.0126525], [4301, 4610, 0.0126525], [4610, 4694, 0.0126525], [4694, 5163, 0.0126525], [5163, 5164, 0.0126525], [5164, 5530, 0.013], [5530, 5858, 0.0127005]]}

snplist = [1786, 2463, 2907, 3068, 3086, 3398, 5468, 5531, 5564, 5580]

函数:

def foo(intervals, snplist):
output=[]
for n in snplist:
for key, value in intervals.items():
for idx in np.where( np.logical_and(value[:,0] < n, n < value[:,1]) )[0]:
output.append([idx, n, value[idx][2]])
return output

def bar(intervals, snplist):
output=[]
for element in snplist:
for key, value in intervals.items():
for left, right, rho in value:
if left <= element <= right:
output.append([value.index([left, right, rho]), element, rho])
return output

在这个设置中,bar 对我来说是 foo 的三倍:

%timeit bar(intervals_list, snplist)
The slowest run took 6.22 times longer than the fastest. This could mean that an intermediate result is being cached
100000 loops, best of 3: 13.5 µs per loop

%timeit foo(intervals_numpy, snplist)
The slowest run took 5.99 times longer than the fastest. This could mean that an intermediate result is being cached
10000 loops, best of 3: 39.8 µs per loop

但是 numpy 为大型数组带来了返回!在此设置中,速度提高了大约 500 倍:

intervals_numpy['Sca1'] = np.repeat(intervals_numpy['Sca1'], 1000, axis=0)
intervals_list['Sca1'] = intervals_numpy['Sca1'].tolist()

%timeit bar(intervals_list, snplist)
1 loops, best of 3: 2.05 s per loop

%timeit foo(intervals_numpy, snplist)
100 loops, best of 3: 4.04 ms per loop

这种巨大的速度差异大部分是因为您的索引查找,请参阅 Martin Evans 的回答。但 numpy 版本对我来说仍然更快一些。

关于python - 提高 python 代码的性能,旨在使用大型数据集查找特定区间内的数字,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/34792534/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com