binary-data - 这个二进制编码器的功能是如何工作的？-6ren

binary-data - 这个二进制编码器的功能是如何工作的？

转载作者：行者123 更新时间：2023-12-04 00:33:38

我试图理解这个二进制编码器背后的逻辑。

它自动获取分类变量并对它们进行虚拟编码(类似于 sklearn 上的单热编码)，但将输出列的数量减少为唯一值长度的 log2。

基本上，当我使用这个库时，我注意到我的虚拟变量仅限于几个唯一值。经过进一步调查，我注意到这个 @staticmethod，它采用分类变量中唯一值的 len 的 log2。

我的问题是为什么？我意识到这会降低输出数据的维度，但这样做背后的逻辑是什么？取log2如何确定需要多少位数来表示数据？

 def calc_required_digits(X, col):
        """
        figure out how many digits we need to represent the classes present
        """
        return int( np.ceil(np.log2(len(X[col].unique()))) )

完整源代码:

"""Binary encoding"""

import copy
import pandas as pd
import numpy as np
from sklearn.base import BaseEstimator, TransformerMixin
from category_encoders.ordinal import OrdinalEncoder
from category_encoders.utils import get_obj_cols, convert_input

__author__ = 'willmcginnis'


[docs]class BinaryEncoder(BaseEstimator, TransformerMixin):
    """Binary encoding for categorical variables, similar to onehot, but stores categories as binary bitstrings.

    Parameters
    ----------

    verbose: int
        integer indicating verbosity of output. 0 for none.
    cols: list
        a list of columns to encode, if None, all string columns will be encoded
    drop_invariant: bool
        boolean for whether or not to drop columns with 0 variance
    return_df: bool
        boolean for whether to return a pandas DataFrame from transform (otherwise it will be a numpy array)
    impute_missing: bool
        boolean for whether or not to apply the logic for handle_unknown, will be deprecated in the future.
    handle_unknown: str
        options are 'error', 'ignore' and 'impute', defaults to 'impute', which will impute the category -1. Warning: if
        impute is used, an extra column will be added in if the transform matrix has unknown categories.  This can causes
        unexpected changes in dimension in some cases.

    Example
    -------
    >>>from category_encoders import *
    >>>import pandas as pd
    >>>from sklearn.datasets import load_boston
    >>>bunch = load_boston()
    >>>y = bunch.target
    >>>X = pd.DataFrame(bunch.data, columns=bunch.feature_names)
    >>>enc = BinaryEncoder(cols=['CHAS', 'RAD']).fit(X, y)
    >>>numeric_dataset = enc.transform(X)
    >>>print(numeric_dataset.info())

    <class 'pandas.core.frame.DataFrame'>
    RangeIndex: 506 entries, 0 to 505
    Data columns (total 16 columns):
    CHAS_0     506 non-null int64
    RAD_0      506 non-null int64
    RAD_1      506 non-null int64
    RAD_2      506 non-null int64
    RAD_3      506 non-null int64
    CRIM       506 non-null float64
    ZN         506 non-null float64
    INDUS      506 non-null float64
    NOX        506 non-null float64
    RM         506 non-null float64
    AGE        506 non-null float64
    DIS        506 non-null float64
    TAX        506 non-null float64
    PTRATIO    506 non-null float64
    B          506 non-null float64
    LSTAT      506 non-null float64
    dtypes: float64(11), int64(5)
    memory usage: 63.3 KB
    None

    """
    def __init__(self, verbose=0, cols=None, drop_invariant=False, return_df=True, impute_missing=True, handle_unknown='impute'):
        self.return_df = return_df
        self.drop_invariant = drop_invariant
        self.drop_cols = []
        self.verbose = verbose
        self.impute_missing = impute_missing
        self.handle_unknown = handle_unknown
        self.cols = cols
        self.ordinal_encoder = None
        self._dim = None
        self.digits_per_col = {}

[docs]    def fit(self, X, y=None, **kwargs):
        """Fit encoder according to X and y.

        Parameters
        ----------

        X : array-like, shape = [n_samples, n_features]
            Training vectors, where n_samples is the number of samples
            and n_features is the number of features.
        y : array-like, shape = [n_samples]
            Target values.

        Returns
        -------

        self : encoder
            Returns self.

        """

        # if the input dataset isn't already a dataframe, convert it to one (using default column names)
        # first check the type
        X = convert_input(X)

        self._dim = X.shape[1]

        # if columns aren't passed, just use every string column
        if self.cols is None:
            self.cols = get_obj_cols(X)

        # train an ordinal pre-encoder
        self.ordinal_encoder = OrdinalEncoder(
            verbose=self.verbose,
            cols=self.cols,
            impute_missing=self.impute_missing,
            handle_unknown=self.handle_unknown
        )
        self.ordinal_encoder = self.ordinal_encoder.fit(X)

        for col in self.cols:
            self.digits_per_col[col] = self.calc_required_digits(X, col)

        # drop all output columns with 0 variance.
        if self.drop_invariant:
            self.drop_cols = []
            X_temp = self.transform(X)
            self.drop_cols = [x for x in X_temp.columns.values if X_temp[x].var() <= 10e-5]



        return self


[docs]    def transform(self, X):
        """Perform the transformation to new categorical data.

        Parameters
        ----------

        X : array-like, shape = [n_samples, n_features]

        Returns
        -------

        p : array, shape = [n_samples, n_numeric + N]
            Transformed values with encoding applied.

        """

        if self._dim is None:
            raise ValueError('Must train encoder before it can be used to transform data.')

        # first check the type
        X = convert_input(X)

        # then make sure that it is the right size
        if X.shape[1] != self._dim:
            raise ValueError('Unexpected input dimension %d, expected %d' % (X.shape[1], self._dim, ))

        if not self.cols:
            return X

        X = self.ordinal_encoder.transform(X)

        X = self.binary(X, cols=self.cols)

        if self.drop_invariant:
            for col in self.drop_cols:
                X.drop(col, 1, inplace=True)

        if self.return_df:
            return X
        else:
            return X.values


[docs]    def binary(self, X_in, cols=None):
        """
        Binary encoding encodes the integers as binary code with one column per digit.
        """

        X = X_in.copy(deep=True)

        if cols is None:
            cols = X.columns.values
            pass_thru = []
        else:
            pass_thru = [col for col in X.columns.values if col not in cols]

        bin_cols = []
        for col in cols:
            # get how many digits we need to represent the classes present
            digits = self.digits_per_col[col]

            # map the ordinal column into a list of these digits, of length digits
            X[col] = X[col].map(lambda x: self.col_transform(x, digits))

            for dig in range(digits):
                X[str(col) + '_%d' % (dig, )] = X[col].map(lambda r: int(r[dig]) if r is not None else None)
                bin_cols.append(str(col) + '_%d' % (dig, ))

        X = X.reindex(columns=bin_cols + pass_thru)

        return X


[docs]    @staticmethod
    def calc_required_digits(X, col):
        """
        figure out how many digits we need to represent the classes present
        """
        return int( np.ceil(np.log2(len(X[col].unique()))) )




[docs]    @staticmethod
    def col_transform(col, digits):
        """
        The lambda body to transform the column values
        """

        if col is None or float(col) < 0.0:
            return None
        else:

            col = list("{0:b}".format(int(col)))
            if len(col) == digits:
                return col
            else:
                return [0 for _ in range(digits - len(col))] + col

最佳答案

My question is WHY? I realize that this reduces the dimensionality of the output data, but what is the logic behind doing this?

基本上，分类编码的问题是让您的算法处理分类特征。因此，有几种方法可以做到这一点，包括二进制编码。实际上，如果你理解的话，它的逻辑接近于 One Hot Encoding (OHE) 的逻辑。

对于二进制编码，分类向量中的每个唯一标签随机关联到 (0) 和(唯一标签数 - 1)之间的数字。现在，您将此数字编码为基数 2，并通过新创建的列将之前的数字“转录”为 0 和 1。例如，假设您的数据集具有三个不同的标签:“A”、“B”和“C”。
下面是随机建立的对应关系:

'A' -> 1 -> 01；

'B' -> 2 > 10;

'C' -> 0 -> 00。

因此，给定数据集的编码示例是:

索引 my_category enc_category_0 enc_category_1

0 一个, 1, 0

1, B, 0, 1

2, C, 0, 0

3 A, 1, 0

关于它的实用性，正如你所说，它是降低维度的。此外，我想它有助于在编码列中没有太多的零，就像 OHE 一样。这是一个有趣的帖子:https://medium.com/data-design/visiting-categorical-features-and-encoding-in-decision-trees-53400fa65931

How does taking the log2 determine how many digits are needed to represent the data? If you understood the working principle, you understand the use of the log2. Computing the log2 of a number retrives the necessary number of digits for a binary encoding of this number. Example: [log2(10)]=[3.32]=4, 4 digits are needed for binary encode 10.

希望我说清楚了，

秋

关于binary-data - 这个二进制编码器的功能是如何工作的？，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/47063221/

文章推荐： python - 从具有对象属性之一的对象列表中获取索引

文章推荐： visual-studio - 无法从 Visual Studio 向项目添加基本类

文章推荐： R如何在ggplot2中为金字塔状图添加分面标签

eclipse - 一旦在 eclipse RCP 中安装了新的插件/功能，是否有任何方法可以自动从磁盘中清除旧的插件/功能？
我正在构建一个 RCP 应用程序，其中每个季度都会更新功能/插件。因此，如果用户选择自动更新功能/插件，则会下载更新插件的新 jar，但旧插件仍在使用我不再使用的磁盘空间。我厌倦了删除包含旧 jar
extjs - 如何从外部 Controller 功能(如sencha touch中的全局功能)调用 Controller 功能
我如何从外部 Controller 功能中调用 Controller 内部的功能，例如电话间隙回调功能这是 Controller 外部定义的功能 function onDeviceReady()
dart - 检查( native )功能/类/功能(例如 MediaSource)是否可用/受支持
如果某个功能(例如 MediaSource)可用，我如何使用 Google Dart 检查。 new MediaSource() 抛出一个错误。如何以编程方式检查此类或功能是否存在？有任何想法吗？是否
azure - Orchestrator 功能 'XYZ' 失败 : The function 'XYZ' doesn't exist, 已禁用，或者不是 Orchestrator 功能
我正在尝试运行 Azure Orchestrations，突然我开始从 statusQueryGetUri 收到错误: 协调器函数“UploadDocumentOrchestrator”失败:函数“U
iphone - 在一个可执行文件中使用 iPhone OS 3.0 功能(如果可用)和 2.1 功能(如果不可用)
我见过 iPhone 上的应用程序，如果在 3.0 上运行，将使用 3.0 功能/API，例如应用内电子邮件编辑器，如果在 2.x 上运行，则不使用这些功能，并退出应用程序以启动邮件相反。这是怎么做
功能 "normalization"
这是 DB 规范化理论中的一个概念: Third normal form is violated when a non-key field is a fact about another non-ke
正确的#if 功能
如果我定义 #if SOMETHING #endif 而且我还没有在任何地方定义 SOMETHING。 #if 中的代码会编译吗？最佳答案当#if的参数表达式中使用的名称未定义为宏时(在所有其他宏
algorithm - A* 功能
我刚刚澄清了 A* 路径查找应该如何在两条路径具有相等值的 [情况] 下运行，无论是在计算期间还是在结束时，如果有两条相等的短路径。例如，我在我的起始节点，我可以扩展到两个可能的节点，但它们都具有相
Java 功能
Java有没有类似下面的东西宏一种遍历所有私有(private)字段的方法类似于 smalltalk symbols 的东西——即用于快速比较静态字符串的东西？请注意，我正在尝试为 black
c - "while()"功能？
这个程序应该将华氏度转换为摄氏度: #include int main() { float fahrenheit, celsius; int max, min, step;
LOTO示波器软件PC缓存(波形录制与回放)功能
当打开PC缓存功能后, 软件将采用先进先出的原则排队对示波器采集的每一帧数据, 进行帧缓存。当发现屏幕中有感兴趣的波形掠过时, 鼠标点击软件的(暂停)按钮, 可以选择回看某一帧的波形
r - 自定义环境中的范围(功能)
我有一个特殊的(虚拟)函数，我想在沙盒环境中使用它: disable.system.call eval(parse(text = 'model.frame("1 ~ 1")'), envir = e
ServiceStack CORS 功能
使用新的 Service 实现，我是否必须为我的所有服务提供一个 Options 方法？使用我的所有服务当前使用的旧 ServiceBase 方法，OPTIONS 返回 OK，但没有 Access-
Clojure 线程!功能
我正在阅读 Fogus 的关于 Clojure 的喜悦的书，在并行编程章节中，我看到了一个函数定义，它肯定想说明一些重要的事情，但我不知道是什么。此外，我看不到这个函数有什么用 - 当我执行时，它什么
vim - 如何限制vim的%功能？
我有大量的 C 代码，大部分代码被注释掉和/或 #if 0。当我使用 % 键匹配 if-else 的左括号和右括号时，它也匹配注释掉的代码。有没有办法或vim插件在匹配括号时不考虑注释掉或#if 0
SML map 功能
我有这个功能: map(map(fn x =>[x])) [[],[1],[2,3,4]]; 产生: val it = [[],[[1]],[[2],[3],[4]]] 我不明白这个功能是如何工作的。
Azure 功能 - 门户代码部署功能正在跳过构建
我使用 Visual Studio 代码创建了一个函数应用程序，然后发布了它。功能应用程序运行良好。我现在在功能门户中使用代码部署功能(KUDU)并跳过构建。下面是日志 9:55:46 AM
r - R如何根据现有数据创建列/功能
我有一个数据框df: userID Score Task_Alpha Task_Beta Task_Charlie Task_Delta 3108 -8.00 Easy Easy
r - 功能:将返回的数据框保存到工作区
我真的无法解决这个问题: 我有一个返回数据框的函数。但是，数据框仅打印在我的控制台中，尽管我希望将其存储在工作空间中。我怎样才能做到这一点？样本数据: n <- 32640 t <- seq(3*p
playframework - 类型安全激活器可用的命令行选项/功能
有没有办法找出所有可能的激活器命令行选项？ activator -help仅提供最低限度的可用选项/功能列表，但所有好的东西都隐藏起来，即使在 typesafe 网站在线文档中也不可用。到目前为止，

行者123

个人简介

我是一名优秀的程序员,十分优秀！

作者热门文章

滴滴打车优惠券免费领取

全站热门文章

首页

博学

6Ren·AI

商城

binary-data - 这个二进制编码器的功能是如何工作的？