gpt4 book ai didi

python - Pandas 按 2 列分组,使用另一列查找增量

转载 作者:行者123 更新时间:2023-12-01 08:20:17 24 4
gpt4 key购买 nike

我有一个包含 4909144 行的 pandas 数据框,以 time 作为索引,source_namedest_addresstvaluetime 索引相同。我使用以下命令按 source_namedest_addresstvalue 对 df 进行排序,以便它们按时间顺序分组:

df = df.sort_values(by=['sourcehostname','destinationaddress','tvalue'])

这给了我:

                        source_name  dest_address   tvalue                 
time
2019-02-06 15:00:54.000 source_1 72.21.215.90 2019-02-06 15:00:54.000
2019-02-06 15:01:00.000 source_1 72.21.215.90 2019-02-06 15:01:00.000
2019-02-06 15:30:51.000 source_1 72.21.215.90 2019-02-06 15:30:51.000
2019-02-06 15:30:51.000 source_1 72.21.215.90 2019-02-06 15:30:51.000
2019-02-06 15:00:54.000 source_1 131.107.0.89 2019-02-06 15:00:54.000
2019-02-06 15:01:14.000 source_1 131.107.0.89 2019-02-06 15:01:14.000
2019-02-06 15:03:02.000 source_2 69.63.191.1 2019-02-06 15:03:02.000
2019-02-06 15:08:02.000 source_2 69.63.191.1 2019-02-06 15:08:02.000

我想要时间之间的差异,所以我使用:

#Create delta
df['delta'] = (df['tvalue']-df['tvalue'].shift()).fillna(0)

这给了我:

                        source_name  dest_address   tvalue                 delta
time
2019-02-06 15:00:54.000 source_1 72.21.215.90 2019-02-06 15:00:54.000 00:00:00
2019-02-06 15:01:00.000 source_1 72.21.215.90 2019-02-06 15:01:00.000 00:00:06
2019-02-06 15:30:51.000 source_1 72.21.215.90 2019-02-06 15:30:51.000 00:29:51
2019-02-06 15:30:51.000 source_1 72.21.215.90 2019-02-06 15:30:51.000 00:00:00
2019-02-06 15:00:54.000 source_1 131.107.0.89 2019-02-06 15:00:54.000 -1 days +23:30:03
2019-02-06 15:01:14.000 source_1 131.107.0.89 2019-02-06 15:01:14.000 00:00:20
2019-02-06 15:03:02.000 source_2 69.63.191.1 2019-02-06 15:03:02.000 00:01:48
2019-02-06 15:08:02.000 source_2 69.63.191.1 2019-02-06 15:08:02.000 00:05:00

但我想按 source_namedest_address 进行分组,并获取 tvalue 中的差异,这样我就不会遇到delta-1 days +23:30:00delta00:01:48 之后第一个 source_2 条目应为 00:00:00

我正在尝试:

df.groupby(['sourcehostname','destinationaddress'])['tvalue'].diff().fillna(0)

但这需要非常非常长的时间,并且可能无法为我提供我想要的结果。

以下内容不起作用,但您可以像我的原始代码一样执行一些操作,但添加分组依据吗?:

#Create delta
df['delta'] = df.groupby(['sourcehostname','destinationaddress'])(df['tvalue']-df['tvalue'].shift()).fillna(0)

最佳答案

import datetime as dt

source_changed = df['sourcehostname'] != df['sourcehostname'].shift()
dest_changed = df['destinationaddress'] != df['destinationaddress'].shift()
change_occurred = (source_changed | dest_changed)

time_diff = df['tvalue'].diff()

now = dt.datetime.utcnow()
zero_delta = now - now

df['time_diff'] = time_diff
df['change_occurred'] = change_occurred

# Then do a function
# If df['change_occurred'] is True -> set the value of df['delta'] to zero_delta
# Else set df['delta'] to the value at df['time_dff']

关于python - Pandas 按 2 列分组,使用另一列查找增量,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/54693349/

24 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com