gpt4 book ai didi

python - 如何将一列中的行值与组内不同列中的所有其他行进行比较?

转载 作者:太空宇宙 更新时间:2023-11-03 15:32:15 26 4
gpt4 key购买 nike

我有一个包含以下列的数据框:user_id、product_id、created_at 和 removed_at。我想添加一个 bool 列“is_switch”,如果对于给定用户,created_at 的时间戳在 timedelta 内(假设 1 秒)作为该用户组中任何其他行的 removed_at,则该列为 True。如何在不遍历每一行的情况下执行此操作,或者这是执行此操作的适当方法吗?

我正在尝试编写一个自定义函数以与将在每个用户组上运行的 .apply 一起使用,但我不确定如何一次性将行与所有其他行进行比较。

# Code to create sample data frame. 
# the below are just timestamps that are within a second of each other.

import datetime

a = datetime.datetime.now()
a2 = a-datetime.timedelta(seconds=1)
b = datetime.datetime.now()-datetime.timedelta(days=4)
b2 = b-datetime.timedelta(seconds=1)
c = datetime.datetime.now()-datetime.timedelta(days=40)
c2 = c - datetime.timedelta(seconds=1)
d = datetime.datetime.now()-datetime.timedelta(days=30)
d2 = d - datetime.timedelta(seconds=1)
e = datetime.datetime.now()-datetime.timedelta(days=60)
e2 = e - datetime.timedelta(seconds=1)
f = datetime.datetime.now()-datetime.timedelta(days=100)
g = datetime.datetime.now()-datetime.timedelta(days=99)

df = pd.DataFrame(
{"user_id" : [0, 1, 1, 2, 2, 2, 3, 3, 3, 4, 4, 4],
"product_id" : [100, 101, 102, 101, 102, 104, 105, 106, 107, 105, 106, 107],
"created_at" : [a, a, b, c, d, c, f, f, e2, f, f, d],
"removed_at" : ['NaT', b2, 'NaT', d2, 'NaT', 'NaT', e, g, 'NaT', e2, g, b]},
index = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11])
df

print(df)

产生这个:


user_id product_id created_at removed_at
0 0 100 2019-08-04 09:15:05.200981 NaT
1 1 101 2019-08-04 09:15:05.200981 2019-07-31 09:15:04.201063
2 1 102 2019-07-31 09:15:05.201063 NaT
3 2 101 2019-06-25 09:15:05.201121 2019-07-05 09:15:04.201179
4 2 102 2019-07-05 09:15:05.201179 NaT
5 2 104 2019-06-25 09:15:05.201121 NaT
6 3 105 2019-04-26 09:15:05.201290 2019-06-05 09:15:05.201235
7 3 106 2019-04-26 09:15:05.201290 2019-04-27 09:15:05.201324
8 3 107 2019-06-05 09:15:04.201235 NaT
9 4 105 2019-04-26 09:15:05.201290 2019-06-05 09:15:04.201235
10 4 106 2019-04-26 09:15:05.201290 2019-04-27 09:15:05.201324
11 4 107 2019-07-05 09:15:05.201179 2019-07-31 09:15:05.201063

所以我现在有这样的东西:

group_by_user = df.groupby('user_id')

def calculate_is_switch(grp):
# What goes here? how can i do it without iterating over each row?

# group_by_user.apply(calculate_is_switch)

我想添加“is_switch”列,所以输出是这样的:

    user_id  product_id                 created_at                 removed_at  \
0 0 100 2019-08-04 09:15:05.200981 NaT
1 1 101 2019-08-04 09:15:05.200981 2019-07-31 09:15:04.201063
2 1 102 2019-07-31 09:15:05.201063 NaT
3 2 101 2019-06-25 09:15:05.201121 2019-07-05 09:15:04.201179
4 2 102 2019-07-05 09:15:05.201179 NaT
5 2 104 2019-06-25 09:15:05.201121 NaT
6 3 105 2019-04-26 09:15:05.201290 2019-06-05 09:15:05.201235
7 3 106 2019-04-26 09:15:05.201290 2019-04-27 09:15:05.201324
8 3 107 2019-06-05 09:15:04.201235 NaT
9 4 105 2019-04-26 09:15:05.201290 2019-06-05 09:15:04.201235
10 4 106 2019-04-26 09:15:05.201290 2019-04-27 09:15:05.201324
11 4 107 2019-07-05 09:15:05.201179 2019-07-31 09:15:05.201063

is_switch
0 False
1 False
2 True
3 False
4 True
5 False
6 False
7 False
8 True
9 False
10 False
11 False

最佳答案

使用GroupBy.apply使用自定义函数 - 首先用一些默认值日期时间替换缺失值,例如Timestamp.min 然后每组比较列与广播 - 所有值都由 removed_atcreated_at 获取绝对值,比较 1 秒并返回any 每行至少有一个 True:

val = pd.Timedelta(1, unit='s')

def f(x):
y = x['created_at'].values - x['removed_at'].values[:, None]
y = np.any((np.abs(y).astype(np.int64) <= val.value), axis=0)

return pd.Series(y, index=x.index)

df['is_switch'] = (df.assign(removed_at = df['removed_at'].fillna(pd.Timestamp.min))
.groupby('user_id')
.apply(f)
.reset_index(level=0, drop=True))

print(df)
user_id product_id created_at removed_at \
0 0 100 2019-08-04 16:22:39.309093 NaT
1 1 101 2019-08-04 16:22:39.309093 2019-07-31 16:22:38.309093
2 1 102 2019-07-31 16:22:39.309093 NaT
3 2 101 2019-06-25 16:22:39.309093 2019-07-05 16:22:38.309093
4 2 102 2019-07-05 16:22:39.309093 NaT
5 2 104 2019-06-25 16:22:39.309093 NaT
6 3 105 2019-04-26 16:22:39.309093 2019-06-05 16:22:39.309093
7 3 106 2019-04-26 16:22:39.309093 2019-04-27 16:22:39.309093
8 3 107 2019-06-05 16:22:38.309093 NaT
9 4 105 2019-04-26 16:22:39.309093 2019-06-05 16:22:38.309093
10 4 106 2019-04-26 16:22:39.309093 2019-04-27 16:22:39.309093
11 4 107 2019-07-05 16:22:39.309093 2019-07-31 16:22:39.309093

is_switch
0 False
1 False
2 True
3 False
4 True
5 False
6 False
7 False
8 True
9 False
10 False
11 False

关于python - 如何将一列中的行值与组内不同列中的所有其他行进行比较?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/57346531/

26 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com