gpt4 book ai didi

python - 对 groupby 对象应用函数以将行附加到每个组

转载 作者:行者123 更新时间:2023-12-01 02:50:27 25 4
gpt4 key购买 nike

我有一个相当大的数据集,但为了可重复性,假设我有以下多索引数据框:

arrays = [['bar', 'bar','bar', 'baz', 'baz', 'foo', 'foo', 'foo', 'qux', 'qux'],
['one', 'one','two', 'one', 'two', 'one', 'two', 'two', 'one', 'two']]
tuples = list(zip(*arrays))
index = pd.MultiIndex.from_tuples(tuples, names=['first', 'second'])
a = pd.DataFrame(np.random.random((10,)), index = index)
a[1] = pd.date_range('2017-07-02', periods=10, freq='5min')

a
Out[68]:
0 1
first second
bar one 0.705488 2017-07-02 00:00:00
one 0.715645 2017-07-02 00:05:00
two 0.194648 2017-07-02 00:10:00
baz one 0.129729 2017-07-02 00:15:00
two 0.449889 2017-07-02 00:20:00
foo one 0.031531 2017-07-02 00:25:00
two 0.320757 2017-07-02 00:30:00
two 0.876243 2017-07-02 00:35:00
qux one 0.443682 2017-07-02 00:40:00
two 0.802774 2017-07-02 00:45:00

我想将当前时间戳附加为由第一秒索引组合标识的每个组的新行。 (例如,bar-onebar-two 等)

我做了什么:

将时间戳附加到每个组的函数:

def myfunction(g, now):
g.loc[g.shape[0], 1] = now # current timestamp
return g

将该函数应用于 groupby 对象,

# current timestamp
now = pd.datetime.now()

a = a.reset_index().groupby(['first', 'second']).apply(lambda x: myfunction(x, now))

这将返回:

               first second         0                       1
first second
bar one 0 bar one 0.705488 2017-07-02 00:00:00.000
1 bar one 0.715645 2017-07-02 00:05:00.000
2 NaN NaN NaN 2017-07-02 02:05:06.442
two 2 bar two 0.194648 2017-07-02 00:10:00.000
1 NaN NaN NaN 2017-07-02 02:05:06.442
baz one 3 baz one 0.129729 2017-07-02 00:15:00.000
1 NaN NaN NaN 2017-07-02 02:05:06.442
two 4 baz two 0.449889 2017-07-02 00:20:00.000
1 NaN NaN NaN 2017-07-02 02:05:06.442
foo one 5 foo one 0.031531 2017-07-02 00:25:00.000
1 NaN NaN NaN 2017-07-02 02:05:06.442
two 6 foo two 0.320757 2017-07-02 00:30:00.000
7 foo two 0.876243 2017-07-02 00:35:00.000
2 NaN NaN NaN 2017-07-02 02:05:06.442
qux one 8 qux one 0.443682 2017-07-02 00:40:00.000
1 NaN NaN NaN 2017-07-02 02:05:06.442
two 9 qux two 0.802774 2017-07-02 00:45:00.000
1 NaN NaN NaN 2017-07-02 02:05:06.442

我不明白为什么引入了新的索引级别,但是,我可以摆脱它并最终得到我想要的:

a = a.reset_index(level = 2).drop(('level_2', 'first', 'second')).loc[:,(0,1)]

0 1
first second
bar one 0.705488 2017-07-02 00:00:00.000
one 0.715645 2017-07-02 00:05:00.000
one NaN 2017-07-02 02:05:06.442
two 0.194648 2017-07-02 00:10:00.000
two NaN 2017-07-02 02:05:06.442
baz one 0.129729 2017-07-02 00:15:00.000
one NaN 2017-07-02 02:05:06.442
two 0.449889 2017-07-02 00:20:00.000
two NaN 2017-07-02 02:05:06.442
foo one 0.031531 2017-07-02 00:25:00.000
one NaN 2017-07-02 02:05:06.442
two 0.320757 2017-07-02 00:30:00.000
two 0.876243 2017-07-02 00:35:00.000
two NaN 2017-07-02 02:05:06.442
qux one 0.443682 2017-07-02 00:40:00.000
one NaN 2017-07-02 02:05:06.442
two 0.802774 2017-07-02 00:45:00.000
two NaN 2017-07-02 02:05:06.442

问题:

我想知道是否有一种优雅的、更简单的方法来执行此操作(向每个组附加一个新行,并且 - 尽管此处未提及 - 有条件地填充该新行的其余字段(时间戳字段除外)。 )

最佳答案

您可以首先按索引进行分组,为每个组构建所需的附加行,然后将其连接回来并对 df 进行排序。

(
pd.concat([a,
a.groupby(level=[0,1]).first().apply(lambda x: [np.nan,dt.datetime.now()]
,axis=1)])
.sort_index()
)

Out[538]:
0 1
first second
bar one 0.587648 2017-07-02 00:00:00.000000
one 0.974524 2017-07-02 00:05:00.000000
one NaN 2017-07-02 15:18:57.503371
two 0.555171 2017-07-02 00:10:00.000000
two NaN 2017-07-02 15:18:57.503371
baz one 0.832874 2017-07-02 00:15:00.000000
one NaN 2017-07-02 15:18:57.503371
two 0.956891 2017-07-02 00:20:00.000000
two NaN 2017-07-02 15:18:57.503371
foo one 0.872959 2017-07-02 00:25:00.000000
one NaN 2017-07-02 15:18:57.503371
two 0.056546 2017-07-02 00:30:00.000000
two 0.359184 2017-07-02 00:35:00.000000
two NaN 2017-07-02 15:18:57.503371
qux one 0.301327 2017-07-02 00:40:00.000000
one NaN 2017-07-02 15:18:57.503371
two 0.891815 2017-07-02 00:45:00.000000
two NaN 2017-07-02 15:18:57.503371

关于python - 对 groupby 对象应用函数以将行附加到每个组,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/44865453/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com