gpt4 book ai didi

python - 如何在填充()期间按列分组显示而不是使用 Pandas 进行聚合?

转载 作者:行者123 更新时间:2023-12-01 01:40:15 24 4
gpt4 key购买 nike

这不是重复的。我已经提到了这个 post_1post_2

我的问题是不同的,而不是关于 agg功能。也是按列分组显示期间ffill操作 .尽管代码运行良好,但只需共享完整代码即可让您有所了解。 问题出在注释行 .注意下面的那一行。

我有一个如下所示的数据框

df = pd.DataFrame({
'subject_id':[1,1,1,1,1,1,1,2,2,2,2,2],
'time_1' :['2173-04-03 12:35:00','2173-04-03 12:50:00','2173-04-05 12:59:00','2173-05-04 13:14:00','2173-05-05 13:37:00','2173-07-06 13:39:00','2173-07-08 11:30:00','2173-04-08 16:00:00','2173-04-09 22:00:00','2173-04-11 04:00:00','2173- 04-13 04:30:00','2173-04-14 08:00:00'],
'val' :[5,5,5,5,1,6,5,5,8,3,4,6]})
df['time_1'] = pd.to_datetime(df['time_1'])
df['day'] = df['time_1'].dt.day
df['month'] = df['time_1'].dt.month

这段代码在论坛的 Jezrael 的帮助下所做的是 add missing dates基于阈值。唯一的问题是,我没有看到 grouped by column during output
df['time_1'] = pd.to_datetime(df['time_1'])
df['day'] = df['time_1'].dt.day
df['date'] = df['time_1'].dt.floor('d')

df1 = (df.set_index('date')
.groupby('subject_id')
.resample('d')
.last()
.index
.to_frame(index=False))
df2 = df1.merge(df, how='left')

thresh = 5
mask = df2['day'].notna()
s = mask.cumsum().mask(mask)
df2['count'] = s.map(s.value_counts())

df2 = df2[(df2['count'] < thresh) | (df2['count'].isna())]

df2 = df2.groupby(df2['subject_id']).ffill() # problem is here #here is the problem

dates = df2['time_1'].dt.normalize()
df2['time_1'] += np.where(dates == df2['date'], 0, df2['date'] - dates)
df2['day'] = df2['time_1'].dt.day
df2['val'] = df2['val'].astype(int)

如上面的代码所示,我尝试了以下方法
df2 = df2.groupby(df2['subject_id']).ffill()  # doesn't help
df2 = df2.groupby(df2['subject_id']).ffill().reset_index() # doesn't help
df2 = df2.groupby('subject_id',as_index=False).ffill() # doesn't help

没有主题 ID 的错误输出

enter image description here

我希望我的输出有 subject_id列也是

最佳答案

这里有 2 种可能的解决方案 - 在 groupby 之后指定列表中的所有列并分配回:

cols = df2.columns.difference(['subject_id'])
df2[cols] = df2.groupby('subject_id')[cols].ffill() # problem is here #here is the problem

或者通过 subject_id 创建索引列和按索引分组:
#newer pandas versions
df2 = df2.set_index('subject_id').groupby('subject_id').ffill().reset_index()

#oldier pandas versions
df2 = df2.set_index('subject_id').groupby(level=0).ffill().reset_index()
dates = df2['time_1'].dt.normalize() 
df2['time_1'] += np.where(dates == df2['date'], 0, df2['date'] - dates)
df2['day'] = df2['time_1'].dt.day
df2['val'] = df2['val'].astype(int)
print (df2)
subject_id date time_1 val day month count
0 1 2173-04-03 2173-04-03 12:35:00 5 3 4.0 NaN
1 1 2173-04-03 2173-04-03 12:50:00 5 3 4.0 NaN
2 1 2173-04-04 2173-04-04 12:50:00 5 4 4.0 1.0
3 1 2173-04-05 2173-04-05 12:59:00 5 5 4.0 1.0
32 1 2173-05-04 2173-05-04 13:14:00 5 4 5.0 1.0
33 1 2173-05-05 2173-05-05 13:37:00 1 5 5.0 1.0
95 1 2173-07-06 2173-07-06 13:39:00 6 6 7.0 1.0
96 1 2173-07-07 2173-07-07 13:39:00 6 7 7.0 1.0
97 1 2173-07-08 2173-07-08 11:30:00 5 8 7.0 1.0
98 2 2173-04-08 2173-04-08 16:00:00 5 8 4.0 NaN
99 2 2173-04-09 2173-04-09 22:00:00 8 9 4.0 NaN
100 2 2173-04-10 2173-04-10 22:00:00 8 10 4.0 1.0
101 2 2173-04-11 2173-04-11 04:00:00 3 11 4.0 1.0
102 2 2173-04-12 2173-04-12 04:00:00 3 12 4.0 1.0
103 2 2173-04-13 2173-04-13 04:30:00 4 13 4.0 1.0
104 2 2173-04-14 2173-04-14 08:00:00 6 14 4.0 1.0

关于python - 如何在填充()期间按列分组显示而不是使用 Pandas 进行聚合?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/58778768/

24 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com