gpt4 book ai didi

python - 如何使用基于使用列的函数的条件选择 Python Dataframe 中的行

转载 作者:行者123 更新时间:2023-12-04 01:27:49 24 4
gpt4 key购买 nike

我有一个看起来像这样的数据框 df :

id          id_latlong
1 (46.1988400;5.209562)
2 (46.1988400;5.209562)
3 (46.1988400;5.209562)
4 (46.1988400;5.209562)
5 (46.438805;5.11890299)
6 (46.222993;5.21707600)
7 (46.195183;5.212575)
8 (46.195183;5.212575)
9 (46.195183;5.212575)
10 (48.917459;2.570821)
11 (48.917459;2.570821)

每一行都是一个位置,“id_latlong”列中的数据是坐标。

我想选择距离指定位置不到 800 米的每个位置的 ID:

defined_location_latlong = "(46.1988400;5.209562)"

我有一个函数可以计算两个坐标之间的距离(以米为单位):

def distance_btw_coordinates (id_latlong1, id_latlong2) :
try :
R = 6372800 # Earth radius in meters

lat1 = float(id_latlong1.partition('(')[2].partition(';')[0])
lon1 = float(id_latlong1.partition(';')[2].partition(')')[0])

lat2 = float(id_latlong2.partition('(')[2].partition(';')[0])
lon2 = float(id_latlong2.partition(';')[2].partition(')')[0])

phi1, phi2 = math.radians(lat1), math.radians(lat2)
dphi = math.radians(lat2 - lat1)
dlambda = math.radians(lon2 - lon1)

a = math.sin(dphi/2)**2 + \
math.cos(phi1)*math.cos(phi2)*math.sin(dlambda/2)**2

distance = 2*R*math.atan2(math.sqrt(a), math.sqrt(1 - a))
except :
distance = 1000000000

return distance

为了选择距定义位置不到 800 米的每一行,我尝试了:

df.loc[distance_btw_cohordonates(df['id_latlong'], defined_location_latlong ) < 800]

但它不起作用:

KeyError: False

它不起作用,因为该函数一次获取“id_latlong”列中的所有数据...

您知道我如何在无需迭代的情况下做到这一点吗?

谢谢!

编辑:我有 500k 个不同的定义位置,我宁愿不必存储 df 中每一行与每个定义位置之间的距离...是否可以选择小于 800 米的每个位置而不必存储距离?

最佳答案

我认为您需要通过 Series.apply 分别处理列的每个值的函数:

s = df['id_latlong'].apply(lambda x: distance_btw_coordinates(x, defined_location_latlong))
print (s)
0 1000000000
1 1000000000
2 1000000000
3 1000000000
4 1000000000
5 1000000000
6 1000000000
7 1000000000
8 1000000000
9 1000000000
10 1000000000
Name: id_latlong, dtype: int64

df.loc[s < 800]

编辑:

Is it possible to select every location that is at less than 800 meters without having to stock the distances ?

一个想法是使用向量化函数 haversine_np ,但有必要更改将字符串解析为列以及数字的代码:

def haversine_np(lon1, lat1, lon2, lat2):
"""
Calculate the great circle distance between two points
on the earth (specified in decimal degrees)

All args must be of equal length.

"""
lon1, lat1, lon2, lat2 = map(np.radians, [lon1, lat1, lon2, lat2])

dlon = lon2 - lon1
dlat = lat2 - lat1

a = np.sin(dlat/2.0)**2 + np.cos(lat1) * np.cos(lat2) * np.sin(dlon/2.0)**2

c = 2 * np.arcsin(np.sqrt(a))
km = 6367 * c
return km

df[['lat','long']] = df['id_latlong'].str.strip('()').str.split(';', expand=True).astype(float)
print (df)
id id_latlong lat long
0 1 (46.1988400;5.209562) 46.198840 5.209562
1 2 (46.1988400;5.209562) 46.198840 5.209562
2 3 (46.1988400;5.209562) 46.198840 5.209562
3 4 (46.1988400;5.209562) 46.198840 5.209562
4 5 (46.438805;5.11890299) 46.438805 5.118903
5 6 (46.222993;5.21707600) 46.222993 5.217076
6 7 (46.195183;5.212575) 46.195183 5.212575
7 8 (46.195183;5.212575) 46.195183 5.212575
8 9 (46.195183;5.212575) 46.195183 5.212575
9 10 (48.917459;2.570821) 48.917459 2.570821
10 11 (48.917459;2.570821) 48.917459 2.570821

lat, long = tuple(map(float, defined_location_latlong.strip('()').split(';')))
print (lat, long)
46.19884 5.209562

s = haversine_np(df['long'], df['lat'], lat, long)
print (s)
0 6016.063040
1 6016.063040
2 6016.063040
3 6016.063040
4 6037.462224
5 6017.186477
6 6015.635700
7 6015.635700
8 6015.635700
9 6353.080382
10 6353.080382
dtype: float64

#km output
df.loc[s < 0.8]

编辑1:

为了提高拆分的性能,可以使用:

#550000 rows for test
df = pd.concat([df] * 50000, ignore_index=True)

df[['lat1','long1']] = pd.DataFrame([x.strip('()').split(';') for x in df['id_latlong']], index=df.index).astype(float)
df[['lat','long']] = df['id_latlong'].str.strip('()').str.split(';', expand=True).astype(float)

print (df)

In [38]: %timeit df[['lat','long']] = df['id_latlong'].str.strip('()').str.split(';', expand=True).astype(float)
2.49 s ± 722 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [39]: %timeit df[['lat1','long1']] = pd.DataFrame([x.strip('()').split(';') for x in df['id_latlong']], index=df.index).astype(float)
937 ms ± 11.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

关于python - 如何使用基于使用列的函数的条件选择 Python Dataframe 中的行,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/61554370/

24 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com