gpt4 book ai didi

Python pandas 将可变数量的重复时间戳更改为唯一

转载 作者:太空宇宙 更新时间:2023-11-03 17:17:31 24 4
gpt4 key购买 nike

这与上一个问题相关:Python pandas change duplicate timestamp to unique ,因此与此名称类似。

额外的要求是每秒处理多个重复项,并将它们均匀地分布在第二个边界之间,即

....
2011/1/4 9:14:00
2011/1/4 9:14:00
2011/1/4 9:14:01
2011/1/4 9:14:01
2011/1/4 9:14:01
2011/1/4 9:14:01
2011/1/4 9:14:01
2011/1/4 9:15:02
2011/1/4 9:15:02
2011/1/4 9:15:02
2011/1/4 9:15:03
....

应该变成:

....
2011/1/4 9:14:00
2011/1/4 9:14:00.500
2011/1/4 9:14:01
2011/1/4 9:14:01.200
2011/1/4 9:14:01.400
2011/1/4 9:14:01.600
2011/1/4 9:14:01.800
2011/1/4 9:14:02
2011/1/4 9:14:02.333
2011/1/4 9:14:02.666
2011/1/4 9:14:03
....

我对如何处理可变数量的重复项感到困惑。

我按照groupby()的思路思考,但无法正确理解。我认为这是一个足够常见的用例,已经解决了。

最佳答案

我将日期时间列转换为timedelta[ms]。但问题是数字太大,所以首先我将年份转换为 epoch time - 2011 - 1970。然后我计算了差异,这些差异被添加到第一列:df['one'] = df['one'] - df['new'] + df['timedelta']。然后是 timedeltas整数毫秒转换为时间增量,最后添加年份 2011 - 1970

#                 time
#0 2011-01-04 09:14:00
#1 2011-01-04 09:14:00
#2 2011-01-04 09:14:01
#3 2011-01-04 09:14:01
#4 2011-01-04 09:14:01
#5 2011-01-04 09:14:01
#6 2011-01-04 09:14:01
#7 2011-01-04 09:15:02
#8 2011-01-04 09:15:02
#9 2011-01-04 09:15:02
#10 2011-01-04 09:15:03
#time datetime64[ns]

#remove years for less timedeltas
df['time1'] = df['time'].apply(lambda x: x - pd.DateOffset(years=2011-1970))
#convert time to timedeltas in miliseconds
df['timedelta'] = pd.to_timedelta(df['time1']) / np.timedelta64(1, 'ms')
df['one'] = 1
#count differences by groupby and transforming mean/sum
m = lambda x: (x.mean()) / x.sum()
df['one'] = df.groupby('time')['one'].transform(m)
#cast float to integer
df['new'] = (df['one']*1000).astype(int)
#need differences by cumulative sum
df['one'] = df.groupby('time')['new'].transform(np.cumsum)
#column cumulatice sum substracting differences and added timedelta
df['one'] = df['one'] - df['new'] + df['timedelta']
#convert integer to timedelta
df['final'] = pd.to_timedelta(df['one'],unit='ms')
#add removed years
df['final'] = df['final'].apply(lambda x: pd.to_datetime(x) + pd.DateOffset(years=2011-1970))
#remove unnecessary columns
df = df.drop(['time1', 'timedelta', 'one', 'new'], axis=1)
print df
# time final
#0 2011-01-04 09:14:00 2011-01-04 09:14:00.000
#1 2011-01-04 09:14:00 2011-01-04 09:14:00.500
#2 2011-01-04 09:14:01 2011-01-04 09:14:01.000
#3 2011-01-04 09:14:01 2011-01-04 09:14:01.200
#4 2011-01-04 09:14:01 2011-01-04 09:14:01.400
#5 2011-01-04 09:14:01 2011-01-04 09:14:01.600
#6 2011-01-04 09:14:01 2011-01-04 09:14:01.800
#7 2011-01-04 09:15:02 2011-01-04 09:15:02.000
#8 2011-01-04 09:15:02 2011-01-04 09:15:02.333
#9 2011-01-04 09:15:02 2011-01-04 09:15:02.666
#10 2011-01-04 09:15:03 2011-01-04 09:15:03.000

关于Python pandas 将可变数量的重复时间戳更改为唯一,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/33528394/

24 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com