gpt4 book ai didi

用于 Python 的 Pandas,分组

转载 作者:行者123 更新时间:2023-12-04 15:12:11 25 4
gpt4 key购买 nike

我有一个数据集,每个时间戳由多个元组组成 - 每个元组都有一个计数。每个时间戳可能存在不同的元组。我想在 5 分钟内将它们组合在一起,并为每个唯一的元组添加计数。使用 Pandas group-by 有没有一种很好的干净的方法来做到这一点?

它们具有以下形式:
((u'67.163.47.231', u'8.27.82.254', 50186, 80, 6, 1377565195000), 2)

这是当前的一个列表,有一个 6 元组(最后一个条目是时间戳),然后计数。

每个时间戳都会有一个 5 元组的集合:

(5-tuple), t-time-stamp, count, example (for only one time stamp)

[((u'71.57.43.240', u'8.27.82.254', 33108, 80, 6, 1377565195000), 1),
((u'67.163.47.231', u'8.27.82.254', 50186, 80, 6, 1377565195000), 2),
((u'8.27.82.254', u'98.206.29.242', 25159, 80, 6, 1377565195000), 1),
((u'71.179.102.253', u'8.27.82.254', 50958, 80, 6, 1377565195000), 1)]

In [220]: df = DataFrame ( { 'key1' : [ (u'71.57.43.240', u'8.27.82.254', 33108, 80, 6), (u'67.163.47.231', u'8.27.82.254', 50186, 80, 6) ], 'data1' : np.array((1,2)), 'data2': np.array((1377565195000,1377565195000))})

In [226]: df
Out[226]:
data1 data2 key1
0 1 1377565195000 (71.57.43.240, 8.27.82.254, 33108, 80, 6)
1 2 1377565195000 (67.163.47.231, 8.27.82.254, 50186, 80, 6)

或转换:
In [231]: df = DataFrame ( { 'key1' : [ (u'71.57.43.240', u'8.27.82.254', 33108, 80, 6), (u'67.163.47.231', u'8.27.82.254', 50186, 80, 6) ], 'data1' : np.array((1,2)), 
.....: 'data2': np.array(( datetime.utcfromtimestamp(1377565195),datetime.utcfromtimestamp(1377565195) )) })

In [232]: df
Out[232]:
data1 data2 key1
0 1 2013-08-27 00:59:55 (71.57.43.240, 8.27.82.254, 33108, 80, 6)
1 2 2013-08-27 00:59:55 (67.163.47.231, 8.27.82.254, 50186, 80, 6)


Here's a simpler example:

time count city
00:00:00 1 Montreal
00:00:00 2 New York
00:00:00 1 Chicago
00:01:00 2 Montreal
00:01:00 3 New York

after bin-ing

time count city
00:05:00 3 Montreal
00:05:00 5 New York
00:05:00 1 Chicago

以下是似乎运作良好的内容:
times = [ parse('00:00:00'), parse('00:00:00'), parse('00:00:00'), parse('00:01:00'), parse('00:01:00'),
parse('00:02:00'), parse('00:02:00'), parse('00:03:00'), parse('00:04:00'), parse('00:05:00'),
parse('00:05:00'), parse('00:06:00'), parse('00:06:00') ]
cities = [ 'Montreal', 'New York', 'Chicago', 'Montreal', 'New York',
'New York', 'Chicago', 'Montreal', 'Montreal', 'New York', 'Chicago', 'Montreal', 'Chicago']
counts = [ 1, 2, 1, 2, 3, 1, 1, 1, 2, 2, 2, 1, 1]
frame = DataFrame( { 'city': cities, 'time': times, 'count': counts } )

In [150]: frame
Out[150]:
city count time
0 Montreal 1 2013-09-07 00:00:00
1 New York 2 2013-09-07 00:00:00
2 Chicago 1 2013-09-07 00:00:00
3 Montreal 2 2013-09-07 00:01:00
4 New York 3 2013-09-07 00:01:00
5 New York 1 2013-09-07 00:02:00
6 Chicago 1 2013-09-07 00:02:00
7 Montreal 1 2013-09-07 00:03:00
8 Montreal 2 2013-09-07 00:04:00
9 New York 2 2013-09-07 00:05:00
10 Chicago 2 2013-09-07 00:05:00
11 Montreal 1 2013-09-07 00:06:00
12 Chicago 1 2013-09-07 00:06:00

frame['time_5min'] = frame['time'].map(lambda x: pd.DataFrame([0],index=pd.DatetimeIndex([x])).resample('5min').index[0])

In [152]: frame
Out[152]:
city count time time_5min
0 Montreal 1 2013-09-07 00:00:00 2013-09-07 00:00:00
1 New York 2 2013-09-07 00:00:00 2013-09-07 00:00:00
2 Chicago 1 2013-09-07 00:00:00 2013-09-07 00:00:00
3 Montreal 2 2013-09-07 00:01:00 2013-09-07 00:00:00
4 New York 3 2013-09-07 00:01:00 2013-09-07 00:00:00
5 New York 1 2013-09-07 00:02:00 2013-09-07 00:00:00
6 Chicago 1 2013-09-07 00:02:00 2013-09-07 00:00:00
7 Montreal 1 2013-09-07 00:03:00 2013-09-07 00:00:00
8 Montreal 2 2013-09-07 00:04:00 2013-09-07 00:00:00
9 New York 2 2013-09-07 00:05:00 2013-09-07 00:05:00
10 Chicago 2 2013-09-07 00:05:00 2013-09-07 00:05:00
11 Montreal 1 2013-09-07 00:06:00 2013-09-07 00:05:00
12 Chicago 1 2013-09-07 00:06:00 2013-09-07 00:05:00

In [153]: df = frame.groupby(['time_5min', 'city']).aggregate('sum')

In [154]: df
Out[154]:
count
time_5min city
2013-09-07 00:00:00 Chicago 2
Montreal 6
New York 6
2013-09-07 00:05:00 Chicago 3
Montreal 1
New York 2

In [155]: df.reset_index(1)
Out[155]:
city count
time_5min
2013-09-07 00:00:00 Chicago 2
2013-09-07 00:00:00 Montreal 6
2013-09-07 00:00:00 New York 6
2013-09-07 00:05:00 Chicago 3
2013-09-07 00:05:00 Montreal 1
2013-09-07 00:05:00 New York 2

最佳答案

如果将日期设置为索引,则可以使用 TimeGrouper(例如,它允许您按 5 分钟间隔进行分组):

In [11]: from pandas.tseries.resample import TimeGrouper

In [12]: df.set_index('data2', inplace=True)

In [13]: g = df.groupby(TimeGrouper('5Min'))

然后,您可以使用 nunique 计算每 5 分钟间隔内唯一项目的数量:
In [14]: g['key1'].nunique()
Out[14]:
2013-08-27 00:55:00 2
dtype: int64

如果您正在寻找每个元组的计数,您可以使用 value_counts:
In [15]: g['key1'].apply(pd.value_counts)
Out[15]:
2013-08-27 00:55:00 (71.57.43.240, 8.27.82.254, 33108, 80, 6) 1
(67.163.47.231, 8.27.82.254, 50186, 80, 6) 1
dtype: int64

注意:这是一个带有 MultiIndex 的系列(使用 reset_index 使其成为 DataFrame)。
In [16]: g['key1'].apply(pd.value_counts).reset_index(1)
Out[16]:
level_1 0
2013-08-27 00:55:00 (71.57.43.240, 8.27.82.254, 33108, 80, 6) 1
2013-08-27 00:55:00 (67.163.47.231, 8.27.82.254, 50186, 80, 6) 1

您可能希望提供这些信息更丰富的列名称 :)。

更新:以前我黑了得到 get_dummies ,见编辑历史。

关于用于 Python 的 Pandas,分组,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/18645127/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com