gpt4 book ai didi

用日期范围填充行的 Pythonic 方法

转载 作者:太空狗 更新时间:2023-10-29 21:27:45 24 4
gpt4 key购买 nike

我正在处理一个问题陈述,它要求我填写缺失日期的行(即 pandas 数据框列中两个日期之间的日期)。请看下面的例子。我将 Pandas 用于我当前的方法(如下所述)。

输入数据示例(大约有 25000 行):

A  | B  | C  | Date1    | Date2
a1 | b1 | c1 | 1Jan1990 | 15Aug1990 <- this row should be repeated for all dates between the two dates
.......................
a3 | b3 | c3 | 11May1986 | 11May1986 <- this row should NOT be repeated. Just 1 entry since both dates are same.
.......................
a5 | b5 | c5 | 1Dec1984 | 31Dec2017 <- this row should be repeated for all dates between the two dates
..........................
..........................

预期输出:

A  | B  | C  | Month    | Year
a1 | b1 | c1 | 1 | 1990 <- Since date 1 column for this row was Jan 1990
a1 | b1 | c1 | 2 | 1990
.......................
.......................
a1 | b1 | c1 | 7 | 1990
a1 | b1 | c1 | 8 | 1990 <- Since date 2 column for this row was Aug 1990
..........................
a3 | b3 | c3 | 5 | 1986 <- only 1 row since two dates in input dataframe were same for this row.
...........................
a5 | b5 | c5 | 12 | 1984 <- since date 1 column for this row was Dec 1984
a5 | b5 | c5 | 1 | 1985
..........................
..........................
a5 | b5 | c5 | 11 | 2017
a5 | b5 | c5 | 12 | 2017 <- Since date 2 column for this row was Dec 2017

我知道实现此目的的更传统方法(我目前的方法):

  • 遍历每一行。
  • 获取两个日期列之间的天数差异。
  • 如果两列中的日期相同,则只在输出数据框中包含该月和年的一行
  • 如果日期不同(diff > 0),则获取每个日期差异行的所有(月、年)组合并附加到新数据框

由于输入数据有大约 25000 行,我相信输出数据会非常非常大,所以我正在寻找更多的Pythonic 方式来实现这个(如果可能并且比迭代方法更快)!

最佳答案

在我看来,这里使用的最佳工具是 PeriodIndex(用于生成日期之间的月份和年份)。

但是,PeriodIndex 一次只能对一行进行操作。所以,如果我们要去要使用 PeriodIndex,每一行都必须单独处理。那不幸的是意味着循环遍历数据框:

import pandas as pd
df = pd.DataFrame([('a1','b1','c1','1Jan1990','15Aug1990'),
('a3','b3','c3','11May1986','11May1986'),
('a5','b5','c5','1Dec1984','31Dec2017')],
columns=['A','B','C','Date1','Date2'])

result = []
for tup in df.itertuples():
index = pd.PeriodIndex(start=tup.Date1, end=tup.Date2, freq='M')
new_df = pd.DataFrame([(tup.A, tup.B, tup.C)], index=index)
new_df['Month'] = new_df.index.month
new_df['Year'] = new_df.index.year
result.append(new_df)
result = pd.concat(result, axis=0)
print(result)

产量

          0   1   2  Month  Year
1990-01 a1 b1 c1 1 1990 <--- Beginning of row 1
1990-02 a1 b1 c1 2 1990
1990-03 a1 b1 c1 3 1990
1990-04 a1 b1 c1 4 1990
1990-05 a1 b1 c1 5 1990
1990-06 a1 b1 c1 6 1990
1990-07 a1 b1 c1 7 1990
1990-08 a1 b1 c1 8 1990 <--- End of row 1
1986-05 a3 b3 c3 5 1986 <--- Beginning and End of row 2
1984-12 a5 b5 c5 12 1984 <--- Beginning row 3
1985-01 a5 b5 c5 1 1985
1985-02 a5 b5 c5 2 1985
1985-03 a5 b5 c5 3 1985
1985-04 a5 b5 c5 4 1985
... .. .. .. ... ...
2017-09 a5 b5 c5 9 2017
2017-10 a5 b5 c5 10 2017
2017-11 a5 b5 c5 11 2017
2017-12 a5 b5 c5 12 2017 <--- End of row 3

[406 rows x 5 columns]

请注意,您可能真的不需要定义 MonthYear

new_df['Month'] = new_df.index.month
new_df['Year'] = new_df.index.year

因为您已经有了 PeriodIndex,这使得计算月份和年份变得非常容易。

关于用日期范围填充行的 Pythonic 方法,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/53780270/

24 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com