gpt4 book ai didi

python - 按重复日期分组的 Pandas

转载 作者:行者123 更新时间:2023-12-04 15:02:29 24 4
gpt4 key购买 nike

我正在使用 pandas 分析数据集,并尝试根据日期条件将百老汇演出分组在一起。这是问题的简化数据集

+-------------------+-------------+---------+-------+
| Show | Week_ending | Theatre | gross |
+-------------------+-------------+---------+-------+
| Mamma Mia | 05/01/2020 | T1 | 100 |
| Mamma Mia | 12/01/2020 | T1 | 200 |
| Mamma Mia | 19/01/2020 | T1 | 150 |
| Shrek The Musical | 19/01/2020 | T2 | 100 |
| Mamma Mia | 11/08/2019 | T3 | 100 |
| Mamma Mia | 18/08/2019 | T3 | 100 |
| Mamma Mia | 27/12/2009 | T1 | 100 |
| Mamma Mia | 03/10/2010 | T1 | 100 |
| Mamma Mia | 10/01/2010 | T1 | 100 |
+-------------------+-------------+---------+-------+

我想做的是将节目分组在一起,以便进一步比较。在不同剧院上演的同名节目应该分开,我使用

df2 = df.groupby(['show', 'theatre']).mean()

然而这将项目分组

+-------------------+---------+------------+
| Show | Theatre | mean_gross |
+-------------------+---------+------------+
| Mamma Mia | T1 | 100 |
| Shrek The Musical | T2 | 100 |
| Mamma Mia | T3 | 100 |
+-------------------+---------+------------+

这是一个问题,因为它将发生在同一剧院 (T1) 的两场《妈妈咪呀》节目归为一组,即使它们发生的时间相隔 10 年。期望的结果会是这样的((2009)不一定是必需的,只要它们是 sperate)

+-------------------+---------+------------+
| Show | Theatre | mean_gross |
+-------------------+---------+------------+
| Mamma Mia | T1 | 100 |
| Shrek The Musical | T2 | 100 |
| Mamma Mia | T3 | 100 |
| Mamma Mia (2009) | T1 | 100 |
+-------------------+---------+------------+

如果一个节目在同一个剧院,但它们之间有一周的时间间隔,没有记录任何信息,则它们可以被视为单独的节目,这是必不可少的。仅按年份分开是行不通的,因为一个节目可能会从 11 月持续到 3 月,将其分成两个单独的节目没有意义。

最佳答案

基本上您需要的是创建另一个列来标记日期更改超过 7 天的情况。

import pandas as pd

df = pd.DataFrame({'Show':['Mamma Mia', 'Mamma Mia', 'Mamma Mia', 'Shrek The Musical',
'Mamma Mia', 'Mamma Mia', 'Mamma Mia', 'Mamma Mia', 'Mamma Mia'] ,
'Week_ending':['05/01/2020', '12/01/2020', '19/01/2020', '19/01/2020', '11/08/2019', '18/08/2019', '27/12/2009', '03/10/2010', '10/01/2010'],
'Theatre':['T1', 'T1', 'T1', 'T2', 'T3', 'T3', 'T1', 'T1','T1'],
'gross':[100, 200, 150, 100, 100, 100, 100, 100, 100]})

# Change `Week_ending` to datetime
df['Week_ending'] = pd.to_datetime(df['Week_ending'], format="%d/%m/%Y")

# sort
df = df.sort_values(['Show', 'Theatre', 'Week_ending'])

# find the difference in days between two consequitive dates within the same group
df['days_diff'] = df.groupby(['Show', 'Theatre'])['Week_ending'].diff().dt.days.fillna(0)
# Check if the days difference is more than 7 (days)
df['days_diff'] = df['days_diff'] > 7

# create a key column that increment by 1 everytime the days are more than 7
df['key'] = df.groupby(['Show', 'Theatre'])['days_diff'].cumsum()

# resulting dataframe
df_final = df.groupby(['Show', 'Theatre', 'key'])[['gross']].mean().reset_index().drop(columns=['key'])

df_final

+-------------------+---------+------------+
| Show | Theatre | mean_gross |
+-------------------+---------+------------+
| Mamma Mia | T1 | 100 |
| Mamma Mia | T1 | 100 |
| Mamma Mia | T1 | 100 |
| Mamma Mia | T1 | 150 |
| Mamma Mia | T3 | 100 |
| Shrek The Musical | T1 | 100 |
+-------------------+---------+------------+

关于python - 按重复日期分组的 Pandas,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/66732543/

24 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com