python - pandas - 根据另一列中的值使用 bins 定义进行分箱-6ren

python - pandas - 根据另一列中的值使用 bins 定义进行分箱

转载作者：太空宇宙更新时间：2023-11-04 01:25:21

24

4

我正在努力完成这样的任务:我需要从数据框中离散化列中的值，并根据其他列中的值定义 bin。

对于一个最小的工作示例，让我们定义一个简单的数据框:

import pandas as pd
df = pd.DataFrame({'A' : ['one', 'one', 'two', 'three'] * 3,'B' : np.random.randn(12)})

数据框看起来像这样:

        A       B
0       one     2.5772143847077427
1       one     -0.6394141654096013
2       two     0.964652049995486
3       three   -0.3922889559403503
4       one     1.6903991754896424
5       one     0.5741442025742018
6       two     0.6300564981683544
7       three   0.9403680915507433
8       one     0.7044433078166983
9       one     -0.1695006646595688
10      two     0.06376190217285167
11      three   0.277540580579127

现在我想介绍 C 列，它将包含一个 bin 标签，A 列中的每个值都有不同的 bins，即:

(-10,-1,0,1,10) 对于 A == 'one',
(-100,0,100) 对于 A == 'two',
(-999,0,1,2,3) A == 'three'。

期望的输出是:

        A       B       C
0       one     2.5772143847077427      (1, 10]
1       one     -0.6394141654096013     (-1, 0]
2       two     0.964652049995486       (0, 100]
3       three   -0.3922889559403503     (-999, 0]
4       one     1.6903991754896424      (1, 10]
5       one     0.5741442025742018      (0, 1]
6       two     0.6300564981683544      (0, 100]
7       three   0.9403680915507433      (0, 1]
8       one     0.7044433078166983      (0, 1]
9       one     -0.1695006646595688     (-1, 0]
10      two     0.06376190217285167     (0, 100]
11      three   0.277540580579127       (0, 1]

我尝试过使用 pd.cut 或 np.digitize 与 map、apply 的不同组合，但没有成功。

目前，我通过拆分框架并将 pd.cut 分别应用于每个子集，然后合并以获得框架来实现结果，如下所示:

values_in_column_A = df['A'].unique().tolist()
bins = {'one':(-10,-1,0,1,10),'two':(-100,0,100),'three':(-999,0,1,2,3)}

def binnize(df):

    subdf = []
    for i in range(len(values_in_column_A)):
        subdf.append(df[df['A'] == values_in_column_A[i]])
        subdf[i]['C'] = pd.cut(subdf[i]['B'],bins[values_in_column_A[i]])

    return pd.concat(subdf)

这可行，但我认为它不够优雅，我还预计生产中会出现一些速度或内存问题，届时我将拥有数百万行的帧。坦白说，我想这可以做得更好。

我会很感激任何帮助或想法...

最佳答案

这是否解决了您的问题？

df = pd.DataFrame({'A' : ['one', 'one', 'two', 'three'] * 3,
                   'B' : np.random.randn(12)})
bins = {'one': (-10,-1,0,1,10), 'two':(-100,0,100), 'three':(-999,0,1,2,3)}

def func(row):
    return pd.cut([row['B']], bins=bins[row['A']])[0]

df['C'] = df.apply(func, axis=1)

这将返回一个 DataFrame:

        A         B          C
0     one  1.440957    (1, 10]
1     one  0.394580     (0, 1]
2     two -0.039619  (-100, 0]
3   three -0.500325  (-999, 0]
4     one  0.497256     (0, 1]
5     one  0.342222     (0, 1]
6     two -0.968390  (-100, 0]
7   three -0.772321  (-999, 0]
8     one  0.803178     (0, 1]
9     one  0.201513     (0, 1]
10    two  1.178546   (0, 100]
11  three -0.149662  (-999, 0]

binnize 的更快版本:

def binize2(df):
    df['C'] = ''
    for key, values in bins.items():
        mask = df['A'] == key
        df.loc[mask, 'C'] = pd.cut(df.loc[mask, 'B'], bins=values)

%%timeit
df3 = binnize(df1)
10 loops, best of 3: 56.2 ms per loop

%%timeit
binize2(df2)
100 loops, best of 3: 6.64 ms per loop

这可能是因为它就地更改了 DataFrame 而没有创建新的。

关于python - pandas - 根据另一列中的值使用 bins 定义进行分箱，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/18209851/

24

4

0

文章推荐： node.js - 将 NodeJS 作业迁移到 Airflow

文章推荐： c - 2个以上线程的自写互斥锁

文章推荐： javascript - 如何从异步调用返回响应？

文章推荐： c - 使用循环打印全局数组的内容不会产生任何输出

git - .gitignore 语法 : bin vs bin/vs. bin/* 与 bin/**
bin、bin/、bin/*和bin/**有什么区别我的 .gitignore 文件？我一直在使用 bin/，但正在查看 other .gitignore files (在 eclipse file
optimization - Bin Packing : Set amount on bins, 想要最小化最大 bin 重量
给定 n 个无限容量的箱子，我想将 m 件元素装入其中(每个元素都有特定的重量)，同时最小化最重箱子的重量。这不是传统的垃圾箱包装/背包问题，其中垃圾箱的容量有限，而您试图尽量减少使用的垃圾箱数量；
ssh - bin/sh，bin/bash，sbin/nologin，bin/tcsh等有什么区别？
Closed. This question is off-topic。它当前不接受答案。想改善这个问题吗？ Update the question，所以它是用于堆栈溢出的on-topic。 7年前关
java - bin packing 的具体变化(n bins 优化以最大化 bins 的最小值)
我一直在寻找一个问题的解决方案，这个问题显然比我最初想象的更不寻常。 Optaplanner 看起来很有前途，但由于我对 Java 相对缺乏经验，所以我想在深入研究之前调查一下这是否完全不可能。我正
python - 在所有数据框列上应用具有不同 bin 大小的 binning
我有一个琐碎的问题。我有一个非常大的 df 有很多列。我正在尝试找到最有效的方法来对具有不同 bin 大小的所有列进行 bin 并创建一个新的 df。这是一个仅对单个列进行分箱的示例: import
sql - 在每个 bin 中创建具有唯一值的 bin
我想以这样的方式对数字列 (var) 进行分类，使每个分类中的行数大致相同。我的附加要求是该列中的一个(唯一)值不能分配给多个 bin。例如，如果 var 列中的值 1 分配给 bin 1，则不允许将
linux -/bin/sh :/bin: Is a directory
我在 centos (rhel 7)_ 上使用 make 命令和 gcc 7.1 通过代码进行编译，但遇到了以下错误: /bin/sh:/bin: 是一个目录 which sh 返回/usr/bin/
c++ -/usr/bin/ld : cannot find -ldlib/usr/bin/ld: cannot find -lcblas/usr/bin/ld: cannot find -llapack
/usr/bin/ld: cannot find -ldlib /usr/bin/ld: cannot find -lcblas /usr/bin/ld: cannot find -llapack 在
r - 如何将 span bin 信息转换为单独的 bins？
我进行了一项眼动追踪实验，试图检测两个区域(感兴趣区域，名为“代理”和“患者”的 AOI)的注视分布。我将整个时间划分为时间段，例如得到以下列表: Stimulus Participant A
linux - ~/.local/bin 中的文件优先于/usr/bin
我正在尝试替换 whoami带有 ~/.local/bin/ 中的脚本的命令.有没有办法让我的 whoami 获得优先权，这样当我运行 whoami 时，我的脚本就会运行？最佳答案这就是我的 ~/
import - 如何在另一个 bin 中重用主 bin 中的代码？
我的项目结构是这样的: . ├── Cargo.lock ├── Cargo.toml └── src ├── bin │ └── other.rs ├── main.rs
Java - 从 BIN 文件读取数组或将数组写入 BIN 文件
我正在开发一个小型图书馆应用程序，它以这种格式存储技术手册: 目前，我正在尝试将库的内容保存并根据用户的需要加载到 bin 文件中。但是，当我尝试加载库文件时，唯一的变化是“-1”被打印到控制台。此
GitIgnore - 忽略 bin/但包括 bin/*.refresh
我试图忽略 Web 项目的 bin 文件夹，但包含该 bin 文件夹中的 .refresh 文件。这是我的 .gitignore 中的内容: [Bb]in/ #Allow .refresh file
verilog - Functional Coverage - 收集所有未在其他 bin 中收集的值的 bin
ipv6_hdr__f_next_header_cp: coverpoint this.ipv6_hdr.ipv6_f_next_header iff (this.has_ipv6_header){
bash - #!/bin/sh 和 #:/bin/sh 之间的区别
今天有人给我发了一个以 #: 开头的脚本，谷歌搜索后我没有找到任何答案。即使脚本有效，我想知道那是什么意思。最佳答案哇!这让我想起了很多记忆! 回到 1980 年代和 90 年代初期，有两种基本
linux - 将#!/bin/ksh 日期转换函数翻译成#!/bin/sh
我使用这个 ksh 函数将“1-Jan-2011”格式转换为“1.1.2011”。 #!/bin/ksh ##---- function to convert 3 char month into nu
c# - Bin/Debug 和 Bin/Release 的区别
我在 c# winform 项目中使用 sql-server compact 数据库。如果我查看构成解决方案的文件，我可以在以下两个文件夹中看到 exe 和 sdf 文件的副本: /bin/Debu
python - Seaborn 直方图 bin 宽度未扩展到 bin 标签
这个问题与我上一个问题不同。我正在通过以下代码使用 facetgrid 打印直方图。 import numpy as np import pandas as pd import seaborn as
python - 给定一系列 bin 概率，如何生成 bin 计数的随机样本？
我有一个整数需要根据概率分布分成 bin。例如，如果我有 N=100 对象进入 [0.02, 0.08, 0.16, 0.29, 0.45] 那么你可能会得到 [1, 10, 20 , 25, 44]
linux - #!/bin/bash --login vs #!/bin/bash
我在寻找使用 Jenkins 运行 Appium 的脚本时阅读了以下脚本 #!/bin/bash --login killall -9 "iPhone Simulator" &> /dev/null

首页

博学

6Ren·AI

商城

python - pandas - 根据另一列中的值使用 bins 定义进行分箱