gpt4 book ai didi

python - 将数据框中的多列数据(包括自动生成的列)分组

转载 作者:太空宇宙 更新时间:2023-11-03 15:47:06 26 4
gpt4 key购买 nike

我正在尝试从 pandas 数据框中的两列创建索引。但是,在索引中使用“存储桶”值之前,我首先要对其中一列中的值进行“存储桶”。

下面的代码应该有助于进一步解释:

import numpy as np
import pandas as pd

# No error checking, pseudocode ...
def bucket_generator(source_data, colname, step_size):
# create bucket column (string)
source_data['bucket'] = ''

# obtain the series to operate on
series = source_data['colname']

# determine which bucket number each cell in series would belong to,
# by dividing the cell value by the step_size

# Naive way would be to iterate over cells in series, generating a
# bucket label like "bucket_{0:+}".format(cell_value/step_size),
# then stick it in a cell in the bucket column, but there must be a more
# 'dataframe' way of doing it, rather than looping





data = {'a': (10,3,5,7,15,20,10,3,5,7,19,5,7,5,10,5,3,7,20,20),
'b': (98.5,107.2,350,211.2,120.5,-70.8,135.9,205.1,-12.8,280.5,-19.7,77.2,88.2,69.2,101.2,-302.
4,-79.8,-257.6,89.6,95.7),
'c': (12.5,23.4,11.5,45.2,17.6,19.5,0.25,33.6,18.9,6.5,12.5,26.2,5.2,0.3,7.2,8.9,2.1,3.1,19.1,2
0.2)
}

df = pd.DataFrame(data)

df

a b c
0 10 98.5 12.50
1 3 107.2 23.40
2 5 350.0 11.50
3 7 211.2 45.20
4 15 120.5 17.60
5 20 -70.8 19.50
6 10 135.9 0.25
7 3 205.1 33.60
8 5 -12.8 18.90
9 7 280.5 6.50
10 19 -19.7 12.50
11 5 77.2 26.20
12 7 88.2 5.20
13 5 69.2 0.30
14 10 101.2 7.20
15 5 -302.4 8.90
16 3 -79.8 2.10
17 7 -257.6 3.10
18 20 89.6 19.10
19 20 95.7 20.20

这就是我想做的:

  1. 正确实现函数bucket_generator
  2. 按列“a”然后按“桶”标签对数据帧数据进行分组
  3. 从数据帧中选择“a”列中给定值(整数)的行以及存储桶列中的存储桶“标签”。

最佳答案

新答案

关注 OP 的要求

def bucket_generator(source_data, colname, step_size):
series = source_data[colname]
source_data['bucket'] = 'bucket_' + (series // step_size).astype(int).astype(str)

data = {'a': (10,3,5,7,15,20,10,3,5,7,19,5,7,5,10,5,3,7,20,20),
'b': (98.5,107.2,350,211.2,120.5,-70.8,135.9,205.1,-12.8,280.5,-19.7,77.2,88.2,69.2,101.2,-302.4,-79.8,-257.6,89.6,95.7),
'c': (12.5,23.4,11.5,45.2,17.6,19.5,0.25,33.6,18.9,6.5,12.5,26.2,5.2,0.3,7.2,8.9,2.1,3.1,19.1,20.2)
}

df = pd.DataFrame(data)
bucket_generator(df, 'a', 5)

df1 = df.set_index(['a', 'bucket']).sort_index(kind='mergesort')
print(df1.xs((3, 'bucket_0')).reset_index())

dob = {bucket: group for bucket, group in df.groupby(['a', 'bucket'])}
print(dob[(3, 'bucket_0')])

a bucket b c
0 3 bucket_0 107.2 23.4
1 3 bucket_0 205.1 33.6
2 3 bucket_0 -79.8 2.1
a b c bucket
1 3 107.2 23.4 bucket_0
7 3 205.1 33.6 bucket_0
16 3 -79.8 2.1 bucket_0

旧答案

  • 将您想要作为索引级别的级别列表分配给 df 的索引。
  • 使用 pd.qcut 帮助进行分桶
  • 使用列表理解来帮助标记
<小时/>
def enlabeler(s, n):
return ['{}_{}'.format(s, i) for i in range(n)]

df.index = [
pd.qcut(df.a, 3, enlabeler('a', 3)),
pd.qcut(df.b, 3, enlabeler('b', 3)),
pd.qcut(df.c, 3, enlabeler('c', 3))
]

print(df)

enter image description here

<小时/>

更加动态一点,并且具有列的子集

def enlabeler(s, n):
return ['{}_{}'.format(s, i) for i in range(n)]

def cutcol(c, n):
return pd.qcut(c, n, enlabeler(c.name, n))

df.index = df[['a', 'b']].apply(cutcol, n=3).values.T.tolist()

enter image description here

关于python - 将数据框中的多列数据(包括自动生成的列)分组,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/41656105/

26 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com