gpt4 book ai didi

python - GroupBy - Datetime diff() 结合附加条件

转载 作者:行者123 更新时间:2023-11-28 18:31:13 25 4
gpt4 key购买 nike

我有一个如下所示的数据框:

In [265]: df_2
Out[265]:
A ID DATETIME ORDER_FAILED
0 B-028 b76cd912ff 2014-10-08 13:43:27 True
1 B-054 4a57ed0b02 2014-10-08 14:26:19 False
2 B-076 1a682034f8 2014-10-08 14:29:01 False
3 B-023 b76cd912ff 2014-10-08 18:39:34 True
4 B-024 f88g8d7sds 2014-10-08 18:40:18 True
5 B-025 b76cd912ff 2014-10-08 18:42:02 True
6 B-026 b76cd912ff 2014-10-08 18:42:41 False
7 B-033 b76cd912ff 2014-10-08 18:44:30 True
8 B-032 b76cd912ff 2014-10-08 18:46:00 True
9 B-037 b76cd912ff 2014-10-08 18:52:15 True
10 B-046 db959faf02 2014-10-08 18:59:59 False
11 B-053 b76cd912ff 2014-10-08 19:17:48 True
12 B-065 b76cd912ff 2014-10-08 19:21:38 False

我需要在任何失败的订单序列中删除所有重复的“失败订单”(最后一个订单除外)。

A 'sequence' is a series of failed orders that meet the following criteria:

  1. Placed by the same user - identified by 'ID'
  2. Has 'ORDER_FAILED' == True
  3. No consecutive orders are more than 5 minutes away from each other.

我希望这可以像这样完成:

In [298]: df_2[df_2.ORDER_FAILED == True].sort_values(by='DATETIME').groupby('ID')['DATETIME'].diff().dt.total_seconds()
Out[298]:
0 NaN
3 17767.0
4 NaN
5 148.0
7 148.0
8 90.0
9 375.0
11 1533.0
Name: DATETIME, dtype: float64

然后使用 pd.join 达到此目的:

In [302]: df_2 = df_2.join(df_tmp); df_2
Out[302]:
A ID DATETIME ORDER_FAILED diff
0 B-028 b76cd912ff 2014-10-08 13:43:27 True NaN
1 B-054 4a57ed0b02 2014-10-08 14:26:19 False NaN
2 B-076 1a682034f8 2014-10-08 14:29:01 False NaN
3 B-023 b76cd912ff 2014-10-08 18:39:34 True 17767.0
4 B-024 f88g8d7sds 2014-10-08 18:40:18 True NaN
5 B-025 b76cd912ff 2014-10-08 18:42:02 True 148.0
6 B-026 b76cd912ff 2014-10-08 18:42:41 False NaN
7 B-033 b76cd912ff 2014-10-08 18:44:30 True 148.0
8 B-032 b76cd912ff 2014-10-08 18:46:00 True 90.0
9 B-037 b76cd912ff 2014-10-08 18:52:15 True 375.0
10 B-046 db959faf02 2014-10-08 18:59:59 False NaN
11 B-053 b76cd912ff 2014-10-08 19:17:48 True 1533.0
12 B-065 b76cd912ff 2014-10-08 19:21:38 False NaN

然而,不幸的是,这是不正确的。订单 7 应该有 diff == NaN 因为这是一系列失败订单中的第一个订单,紧随该用户成功下单(即订单 6).

我意识到我计算上面的 diff 的方法是错误的,我没能找到在每次成功下单后“重置”计数器的方法。

期望的正确结果是:

In [303]: df_2
Out[303]:
A ID DATETIME ORDER_FAILED diff
0 B-028 b76cd912ff 2014-10-08 13:43:27 True NaN
1 B-054 4a57ed0b02 2014-10-08 14:26:19 False NaN
2 B-076 1a682034f8 2014-10-08 14:29:01 False NaN
3 B-023 b76cd912ff 2014-10-08 18:39:34 True 17767.0
4 B-024 f88g8d7sds 2014-10-08 18:40:18 True NaN
5 B-025 b76cd912ff 2014-10-08 18:42:02 True 148.0
6 B-026 b76cd912ff 2014-10-08 18:42:41 False NaN ## <- successful order
7 B-033 b76cd912ff 2014-10-08 18:44:30 True NaN ## <- since this is the first failed order in this sequence of failed orders
8 B-032 b76cd912ff 2014-10-08 18:46:00 True 90.0
9 B-037 b76cd912ff 2014-10-08 18:52:15 True 375.0
10 B-046 db959faf02 2014-10-08 18:59:59 False NaN
11 B-053 b76cd912ff 2014-10-08 19:17:48 True 1533.0
12 B-065 b76cd912ff 2014-10-08 19:21:38 False NaN

在这一点之后,我会用这样的东西标记 diff > 300 的订单:

>> df_2.ix[df_2['diff'] > 300, 'remove_flag'] = 1
>> df_2.groupby('ID')['remove_flag'].shift(-1) ## <- adjust flag to mark the previous order in the sequence
>> df_2 = df_2[df_2.remove_flag != 1]

这意味着,最终应该保留或丢弃的订单如下所示:

>> df_2 
A ID DATETIME ORDER_FAILED diff
0 B-028 b76cd912ff 2014-10-08 13:43:27 True NaN ## STAYS - Failed, but gap to next failed by same user is greater than 5 minutes
1 B-054 4a57ed0b02 2014-10-08 14:26:19 False NaN ## STAYS - successful order
2 B-076 1a682034f8 2014-10-08 14:29:01 False NaN ## STAYS - successful order
3 B-023 b76cd912ff 2014-10-08 18:39:34 True 17767.0 ## DISCARD - The next failed order by the same user is only 148 seconds away (less than 5 minutes)
4 B-024 f88g8d7sds 2014-10-08 18:40:18 True NaN ## STAYS - successful order
5 B-025 b76cd912ff 2014-10-08 18:42:02 True 148.0 ## STAYS - last in this sequence of failed orders by this user
6 B-026 b76cd912ff 2014-10-08 18:42:41 False NaN ## STAYS - successful order
7 B-033 b76cd912ff 2014-10-08 18:44:30 True NaN ## DISCARD - The next failed order by the same user is only 90 seconds away (less than 5 minutes)
8 B-032 b76cd912ff 2014-10-08 18:46:00 True 90.0 ## STAYS - next failed order by the same user is more than 5 minutes away
9 B-037 b76cd912ff 2014-10-08 18:52:15 True 375.0 ## STAYS - More than 5 minutes away from previous failed order by the same user
10 B-046 db959faf02 2014-10-08 18:59:59 False NaN ## STAYS - Successful order
11 B-053 b76cd912ff 2014-10-08 19:17:48 True 1533.0 ## STAYS - too long since last failed order by this same user
12 B-065 b76cd912ff 2014-10-08 19:21:38 False NaN ## STAYS - Successful order

非常感谢任何帮助,谢谢!

最佳答案

我将从按 ID 和 DATETIME(升序)排序开始:

df1 = df.sort_values(by = ['ID','DATETIME'])

现在,如果我没理解错的话,我们需要移除所有满足以下条件(“下一个”我理解为“在下一行”)合取的订单:

  • 订单失败

  • 下单失败

  • 下单与下单的时间差最多300s

  • (另外)ID 与下一个 ID 相同(否则它是最后一个订单)

我的想法很简单:添加适当的列,使每一行都包含评估这些条件所需的所有数据。

这个添加了“下一个 ID”和“下一个订单”字段:

df1[['Next_ID','Next_ORDER_FAILED']] = df1[['ID','ORDER_FAILED']].shift(-1)

而这个负责下一个订单的时间差:

df1['diff'] = -df1['DATETIME'].diff(-1).dt.total_seconds()

(与 period=-1 的相关差异将为负,因此为负号)。

我相信剩下的已经很简单了。

更新:顺便说一下,即使不向数据框添加新列,我们也可以创建一个 bool 掩码:

mask = (df1['ORDER_FAILED'] == True) and (df1['ORDER_FAILED'].shift(-1) == True) and ...

更新

实际上没有必要按 ID 排序,如果正确使用 groupby(),整个解决方案实际上会更清晰一些。根据上述建议,最后是这样完成的。

In [478]: df_3
Out[478]:
A ID DATETIME ORDER_FAILED
0 B-028 b76cd912ff 2014-10-08 13:43:27 True
1 B-054 4a57ed0b02 2014-10-08 14:26:19 False
2 B-076 1a682034f8 2014-10-08 14:29:01 False
3 B-023 b76cd912ff 2014-10-08 18:39:34 True
4 B-024 f88g8d7sds 2014-10-08 18:40:18 True
5 B-025 b76cd912ff 2014-10-08 18:42:02 True
6 B-026 b76cd912ff 2014-10-08 18:42:41 False
7 B-033 b76cd912ff 2014-10-08 18:44:30 True
8 B-032 b76cd912ff 2014-10-08 18:46:00 True
9 B-037 b76cd912ff 2014-10-08 18:52:15 True
10 B-046 db959faf02 2014-10-08 18:59:59 False
11 B-053 b76cd912ff 2014-10-08 19:17:48 True
12 B-065 b76cd912ff 2014-10-08 19:21:38 False

In [479]: df_3['NEXT_FAILED'] = df_3.sort_values(by='DATETIME').groupby('ID')['ORDER_FAILED'].shift(-1)

In [480]: df_3['SECONDS_TO_NEXT_ORDER'] = -df_3.sort_values(by='DATETIME').groupby('ID')['DATETIME'].diff(-1).dt.total_seconds()

In [481]: condition = (df_3.NEXT_FAILED == True) & (df_3.ORDER_FAILED == True) & (df_3.SECONDS_TO_NEXT_ORDER <= 300)

In [482]: df_3[~condition].drop(['NEXT_FAILED','SECONDS_TO_NEXT_ORDER'], axis=1)
Out[482]:
A ID DATETIME ORDER_FAILED
0 B-028 b76cd912ff 2014-10-08 13:43:27 True
1 B-054 4a57ed0b02 2014-10-08 14:26:19 False
2 B-076 1a682034f8 2014-10-08 14:29:01 False
4 B-024 f88g8d7sds 2014-10-08 18:40:18 True
5 B-025 b76cd912ff 2014-10-08 18:42:02 True
6 B-026 b76cd912ff 2014-10-08 18:42:41 False
8 B-032 b76cd912ff 2014-10-08 18:46:00 True
9 B-037 b76cd912ff 2014-10-08 18:52:15 True
10 B-046 db959faf02 2014-10-08 18:59:59 False
11 B-053 b76cd912ff 2014-10-08 19:17:48 True
12 B-065 b76cd912ff 2014-10-08 19:21:38 False

正确的订单 - 根据 OP 的描述 - 确实被删除了!

关于python - GroupBy - Datetime diff() 结合附加条件,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/37092874/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com