gpt4 book ai didi

python - 创建带有增量计数器的列,用于识别 Pandas 中的重复集

转载 作者:行者123 更新时间:2023-11-28 19:03:58 26 4
gpt4 key购买 nike

我有一个很大的 df,有一个列的子集是相同的 dup_columns = ['id', 'subject','topic', 'lesson', 'time'] 和一些是唯一的 ['timestamps'].

   id    subj   topic lesson  timestamp  time  dup_ind dup_group  time_diff
1 1 math add a timestamp1 45sec True 1 timestamp1-timestamp2
2 1 math add a timestamp2 45sec True 1 timestamp1-timestamp2
3 1 math add a timestamp2 30sec False NaN
4 1 math add a timestamp3 15sec False NaN
5 1 math add b timestamp1 0sec True 2 timestamp1-timestamp4
6 1 math add b timestamp4 0sec True 2 timestamp1-timestamp4
7 1 math add b timestamp1 45sec True 3 timestamp1-timestamp2
8 1 math add b timestamp2 45sec True 3 timestamp1-timestamp2

我有一个列 ['is_duplicate'] 根据 dup_columns 识别重复项。我需要创建另一列 ['dup_group'],通过为其分配一个唯一的重复组值(1,2,3 ,... )。最终我需要这个 dup_group 来比较每个 duplicate_group 中的 timestamp 值(我正在使用 .diff() 方法)。

这是我写的代码:

df2= df1.loc[df1['is_duplicated']==True]
def dup_counter():
for name, group in df11.groupby(dup_columns):
df[name, df['dupsetnew']]+=1
return df['dupsetnew']

df11.groupby(dup_columns).apply(dup_counter)

问题 1:函数给我错误(我是 Python 和编程的新手)

为了计算时间戳的差异,我有以下代码:

df['time_diff'] = df.loc[df.dup_indicator == 1,'event_time'].diff()

问题/问题 2:.diff 是我需要的正确方法吗?

最佳答案

这是一种方式。请注意,我已将 df['timestamp'] 更改为一系列整数以演示原理,但这可以适用于 datetime 项目。

想法是在元组列表上使用 pd.factorize 来识别组。然后同时应用正向和反向 groupby.diff 以获得所需的结果。

df['timestamp'] = [1, 2, 2, 3, 1, 4, 1, 2]

df['dup_group'] = pd.factorize(list(zip(df['id'], df['subj'], df['topic'],
df['lesson'], df['time'])))[0] + 1

df['time_diff'] = df.groupby('dup_group')['timestamp'].transform(pd.Series.diff)

df['time_diff'] = df['time_diff'].fillna(-df.groupby('dup_group')['timestamp']\
.transform(pd.Series.diff, periods=-1))

# id subj topic lesson timestamp time dup_ind dup_group time_diff
# 1 1 math add a 1 45sec True 1 1.0
# 2 1 math add a 2 45sec True 1 1.0
# 3 1 math add a 2 30sec False 2 NaN
# 4 1 math add a 3 15sec False 3 NaN
# 5 1 math add b 1 0sec True 4 3.0
# 6 1 math add b 4 0sec True 4 3.0
# 7 1 math add b 1 45sec True 5 1.0
# 8 1 math add b 2 45sec True 5 1.0

源数据

from numpy import nan

df = pd.DataFrame({'dup_group': {1: 1.0, 2: 1.0, 3: nan, 4: nan, 5: 2.0, 6: 2.0, 7: 3.0, 8: 3.0},
'dup_ind': {1: True, 2: True, 3: False, 4: False, 5: True, 6: True, 7: True, 8: True},
'id': {1: 1, 2: 1, 3: 1, 4: 1, 5: 1, 6: 1, 7: 1, 8: 1},
'lesson': {1: 'a', 2: 'a', 3: 'a', 4: 'a', 5: 'b', 6: 'b', 7: 'b', 8: 'b'},
'subj': {1: 'math', 2: 'math', 3: 'math', 4: 'math', 5: 'math', 6: 'math', 7: 'math', 8: 'math'},
'time': {1: '45sec', 2: '45sec', 3: '30sec', 4: '15sec', 5: '0sec', 6: '0sec', 7: '45sec', 8: '45sec'},
'time_diff': {1: 'timestamp1-timestamp2', 2: 'timestamp1-timestamp2', 3: nan, 4: nan, 5: 'timestamp1-timestamp4', 6: 'timestamp1-timestamp4', 7: 'timestamp1-timestamp2', 8: 'timestamp1-timestamp2'},
'timestamp': {1: 'timestamp1', 2: 'timestamp2', 3: 'timestamp2', 4: 'timestamp3', 5: 'timestamp1', 6: 'timestamp4', 7: 'timestamp1', 8: 'timestamp2'},
'topic': {1: 'add', 2: 'add', 3: 'add', 4: 'add', 5: 'add', 6: 'add', 7: 'add', 8: 'add'}})

关于python - 创建带有增量计数器的列,用于识别 Pandas 中的重复集,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/49202126/

26 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com