gpt4 book ai didi

python - 对于大型列表,如果一个点在 2 个点之间,如何快速匹配

转载 作者:太空狗 更新时间:2023-10-30 02:53:20 26 4
gpt4 key购买 nike

我有一本字典,其中包含关于单个位置的信息:position_info,以及关于特征的信息feature_info。我必须找到位置所在的特征(可以是多个),以便我可以注释位置。我现在使用的是:

feature_info = [[1, 10, 'a'],[15, 30, 'b'],[40, 60, 'c'],[55, 71, 'd'],[73, 84, 'e']]
position_info = {5:'some info', 16:'some other info', 75:'last info'}
for pos in position_info.keys():
for info in feature_info:
if info[0] <= pos < info[1]:
print(pos, position_info[pos], info[2])

问题是 feature_info 包含 800k+ 个特征,position_info 150k 个位置,这很慢。我可以自己稍微优化一下,但可能已经有比我做得更好的方法,但我还没有找到。

编辑

例如,这是我能想到的一种加快速度的方法:

for info in feature_info:
for pos in position_info.keys():
if info[0] <= pos < info[1]:
print(pos, position_info[pos], info[2])
if pos > info[1]:
break

如果我对位置进行排序,当位置大于特征的结束位置时我可以打破(如果我确保这些也被排序)。但是,必须有更好的方法来做到这一点。

我怎样才能以最快的方式实现它?

3个答案的比较

import timeit

setup = """
from bisect import bisect
import pandas as pd
import random
import numpy as np

position_info = {}

random_number = random.sample(range(9000), 8000)
random_feature_start = random.sample(range(90000), 5000)
random_feature_length = np.random.choice(1000, 5000, replace=True)

for i in random_number:
position_info[i] = 'test'
feature_info = []
for index, i in enumerate(random_feature_start):
feature_info.append([i, i+random_feature_length[index],'test'])

"""

p1 = """
sections = sorted(r for a, b, c in feature_info for r in (a,b))
for pos in position_info:
feature_info[int(bisect(sections, pos) / 2)]
"""

p2 = """
# feature info to dataframe
feature_df = pd.DataFrame(feature_info)

# rename feature df columns
feature_df.rename(index=str, columns={0: "start", 1: "end",2:'name'}, inplace=True)

# positions to dataframe
position_df = pd.DataFrame.from_dict(position_info, orient='index')
position_df['key'] = position_df.index

# merge dataframes
feature_df['merge'] = 1
position_df['merge'] = 1
merge_df = feature_df.merge(position_df, on='merge')
merge_df.drop(['merge'], inplace=True, axis=1)

# filter where key between start and end
merge_df = merge_df.loc[(merge_df.key > merge_df.start) & (merge_df.key < merge_df.end)]
"""

p3 = """
feature_df = pd.DataFrame(feature_info)
position_df = pd.DataFrame(position_info, index=[0])
hits = position_df.apply(lambda col: (feature_df [0] <= col.name) & (col.name < feature_df [1])).values.nonzero()
for f, p in zip(*hits):
position_info[position_df.columns[p]]
feature_info[f]
"""

print('bisect:',timeit.timeit(p1, setup=setup, number = 3))
print('panda method 1:',timeit.timeit(p2, setup=setup, number = 3))
print('panda method 2:',timeit.timeit(p3, setup=setup, number = 3))

平分:0.08317881799985116
Pandas 方法一:29.6151025639997
Pandas 方法二:16.90901438500032

但是,二分法仅在没有重叠特征时才有效,例如

feature_info = [[1, 10, 'a'],[15, 30, 'b'],[40, 60, 'c'],[55, 71, 'd'],[2, 8, 'a_new']]

不起作用,它确实适用于 pandas 解决方案。

最佳答案

最快的方法可能是使用快速库:pandas . Pandas 将您的操作矢量化以使其快速。

feature_df = pd.DataFrame(feature_info)
position_df = pd.DataFrame(position_info, index=[0])
hits = position_df.apply(lambda col: (feature_df[0] <= col.name) & (col.name < feature_df[1])).values.nonzero()
for feature, position in zip(*hits):
print(position_info[position_df.columns[p]], "between", feature_info[f])

关于python - 对于大型列表,如果一个点在 2 个点之间,如何快速匹配,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/49923695/

26 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com