gpt4 book ai didi

python - 在分组变量中按先前值(年份)标记滚动重复项

转载 作者:行者123 更新时间:2023-11-28 22:08:52 29 4
gpt4 key购买 nike

我试图弄清楚是否有任何 ID 发生在任何早年(即 dfo 中的 Duplicate 列)。如果是这样,我想将该行标记为重复行并包括 ID 首次出现的年份(即 Year_Duplicate)。

我确实有一个工作代码。

Objective: I want to learn better (or 'pythonic') way to solve this problem in a better way i.e. if there is more condense way to solve it, I'd appreciate any help. I'm not too familiar with all the features we get with numpy and pandas

示例输入

dfi.to_dict() = 
{'Year': {0: 2020,
1: 2020,
2: 2020,
3: 2021,
4: 2021,
5: 2021,
6: 2022,
7: 2022,
8: 2022},
'ID': {0: 1, 1: 2, 2: 3, 3: 1, 4: 4, 5: 2, 6: 5, 7: 1, 8: 4},
'$': {0: 1, 1: 1, 2: 1, 3: 2, 4: 2, 5: 2, 6: 3, 7: 3, 8: 3}}

示例输出:

dfo.to_dict()
{'Year': {0: 2020,
1: 2020,
2: 2020,
3: 2021,
4: 2021,
5: 2021,
6: 2022,
7: 2022,
8: 2022},
'ID': {0: 1, 1: 2, 2: 3, 3: 1, 4: 4, 5: 2, 6: 5, 7: 1, 8: 4},
'$': {0: 1, 1: 1, 2: 1, 3: 2, 4: 2, 5: 2, 6: 3, 7: 3, 8: 3},
'Duplicate': {0: False,
1: False,
2: False,
3: True,
4: False,
5: True,
6: False,
7: True,
8: True},
'Year_Duplicate': {0: nan,
1: nan,
2: nan,
3: 2020.0,
4: nan,
5: 2020.0,
6: nan,
7: 2020.0,
8: 2021.0}}

工作代码:

import pandas as pd
from numpy import nan as NA

dfi=pd.DataFrame.from_dict(dfi)
dfo=pd.DataFrame.from_dict(dfo)

df_process = dfi.copy()
df_process['Duplicate']=df_process['ID'].duplicated()

indexes=df_process.groupby('ID')['Year'].idxmin
df_min_year = df_process[['Year','ID']].loc[indexes]
df_min_year=df_min_year.rename(columns={"Year": "Year_Duplicate"})

df_process=pd.merge(df_process,df_min_year,on=['ID'],how='left')
df_process.loc[df_process['Year_Duplicate']==df_process['Year'],'Year_Duplicate']=NA

dfo.equals(df_process) #returns TRUE

我很乐意回答任何问题。谢谢你帮助我。


来自以下评论的澄清:

  • $ 只是一个表示销售额的数字。它可以被忽略复制。
  • Year_Duplicate 显示该 ID 的第一年发生。如果没有重复,则不需要Year_Duplicate 在这种情况下我们将其留空。

最佳答案

使用Series.duplicatedSeries.whereGroupBy.transformGroupBy.first :

df['Year_Duplicated']=df.groupby('ID')['Year'].transform('first').where(df['ID'].duplicated())
print (df)
Year ID $ Year_Duplicated
0 2020 1 1 NaN
1 2020 2 1 NaN
2 2020 3 1 NaN
3 2021 1 2 2020.0
4 2021 4 2 NaN
5 2021 2 2 2020.0
6 2022 5 3 NaN
7 2022 1 3 2020.0
8 2022 4 3 2021.0

详细信息:

print (df.groupby('ID')['Year'].transform('first'))
0 2020
1 2020
2 2020
3 2020
4 2021
5 2020
6 2022
7 2020
8 2021
Name: Year, dtype: int64

关于python - 在分组变量中按先前值(年份)标记滚动重复项,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/58062128/

29 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com