gpt4 book ai didi

python - Pandas 在大型数据帧上重新采样 n 个点

转载 作者:行者123 更新时间:2023-12-04 15:02:42 24 4
gpt4 key购买 nike

假设我有以下数据框

    id                time        lat       long
0 1 2020-11-01 21:48:00 66.027694 12.627349
1 1 2020-11-01 21:49:00 66.027833 12.630198
2 1 2020-11-01 21:50:00 66.027900 12.635473
3 1 2020-11-01 21:51:00 66.027967 12.640748
4 1 2020-11-01 21:52:00 66.028350 12.643367
5 1 2020-11-01 21:53:00 66.028450 12.643948
6 1 2020-11-01 21:54:00 66.028183 12.643750
7 1 2020-11-01 21:55:00 66.027767 12.643016
8 2 2020-11-01 23:30:00 66.031667 12.639148
9 2 2020-11-01 23:31:00 66.034033 12.637517
10 2 2020-11-01 23:32:00 66.036950 12.636683
11 2 2020-11-01 23:33:00 66.039742 12.636417
12 2 2020-11-01 23:34:00 66.042533 12.636150
13 2 2020-11-01 23:35:00 66.044725 12.636541
14 2 2020-11-01 23:36:00 66.046867 12.637715
15 2 2020-11-01 23:37:00 66.050550 12.641467
16 2 2020-11-01 23:38:00 66.053014 12.644047
17 2 2020-11-01 23:39:00 66.055478 12.646627
18 2 2020-11-01 23:40:00 66.057942 12.649207
19 2 2020-11-01 23:41:00 66.060406 12.651788
20 2 2020-11-01 23:42:00 66.062869 12.654368
21 2 2020-11-01 23:43:00 66.065333 12.656948
22 2 2020-11-01 23:44:00 66.067255 12.658876
23 2 2020-11-01 23:45:00 66.069177 12.660804
24 2 2020-11-01 23:46:00 66.071098 12.662732

我想通过它的 ID 号对每个组进行重新采样,这样我就可以为每个组(及时)等距地得到 5 个点。

从上面的例子来看,结果应该是这样的。

   id                time        lat       long
0 1 2020-11-01 21:47:15 66.027694 12.627349
1 1 2020-11-01 21:49:00 66.027867 12.632836
2 1 2020-11-01 21:50:45 66.028158 12.642057
3 1 2020-11-01 21:52:30 66.028317 12.643849
4 1 2020-11-01 21:54:15 66.027767 12.643016
5 2 2020-11-01 23:28:00 66.032850 12.638333
6 2 2020-11-01 23:32:00 66.040987 12.636448
7 2 2020-11-01 23:36:00 66.051477 12.642464
8 2 2020-11-01 23:40:00 66.061638 12.653078
9 2 2020-11-01 23:44:00 66.069177 12.660804

我已经解决了它并得到了想要的结果,但它会变慢,因为我没有 25 行而是 +1000 万行。

有比我的更好的解决方案

我的代码是:

# Define amount of points
points = 5

# route is the input dataframe (see first table from above)
groups = route.groupby('id')

# 'times' is for getting the first and last time in each group
times = groups['time'].agg(['first','last']).reset_index()

# Calculation the time step for getting 5 datapoints
times['diff'] = (times['last'] - times['first'])/(points-1)

# For saving each series of points
waypoints = []
for (name, group), (time_name, time_group) in zip(groups, times_groups):
# Time step to string in seconds (Not the best solution)
str_time = "{0}s".format(int(time_group['diff'].iloc[0].total_seconds()))
# Saving points
waypoints.append(
group.set_index('time').groupby(
'id'
).resample(
str_time
).mean().interpolate('linear').drop('id', axis = 1).reset_index()
)
# Concatenate back to dataframe (see last table from above)
pd_waypoints = pd.concat(waypoints).reset_index()

最佳答案

这是加快速度的一种方法。这个想法是复制 resample 所做的,这本质上是截断时间的 groupby,但对不同的 id 使用不同的频率,而不是一个一个地遍历组(除了计算频率):

# make a copy of the route dataframe to work on
z = route.copy()

# calculate frequency f in seconds for each id
# and t0 as the midnight of the first day of the group
g = z.groupby('id')['time']
z['f'] = (g.transform('max') - g.transform('min')).astype(int) / (points - 1) // 10**9
z['t0'] = g.transform('min').dt.floor('d').astype(int) // 10**9

# calculate seconds since t0
# this is what .resample(...) operates on
z['s_since_t0'] = z['time'].astype(int) // 10**9 - z['t0']

# get grouped seconds since t0
# in the same way that .resample(...) does
z['s_group'] = z['t0'] + z['s_since_t0'] // z['f'] * z['f']

# convert grouped seconds to datetime
z['time_group'] = pd.to_datetime(z['s_group'], unit='s')

# calculate mean
z.groupby(['id', 'time_group'])[['lat', 'long']].mean().reset_index()

输出:

   id          time_group        lat       long
0 1 2020-11-01 21:47:15 66.027694 12.627349
1 1 2020-11-01 21:49:00 66.027867 12.632835
2 1 2020-11-01 21:50:45 66.028159 12.642057
3 1 2020-11-01 21:52:30 66.028317 12.643849
4 1 2020-11-01 21:54:15 66.027767 12.643016
5 2 2020-11-01 23:28:00 66.032850 12.638332
6 2 2020-11-01 23:32:00 66.040987 12.636448
7 2 2020-11-01 23:36:00 66.051477 12.642464
8 2 2020-11-01 23:40:00 66.061638 12.653078
9 2 2020-11-01 23:44:00 66.069177 12.660804

在 10k 数据集上,此版本比原始版本快 400 倍:

%%timeit
original()

3.72 s ± 21.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%%timeit
proposed()

8.83 ms ± 43.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

关于python - Pandas 在大型数据帧上重新采样 n 个点,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/66686980/

24 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com