gpt4 book ai didi

binary-data - 这个二进制编码器的功能是如何工作的?

转载 作者:行者123 更新时间:2023-12-04 00:33:38 24 4
gpt4 key购买 nike

我试图理解这个二进制编码器背后的逻辑。

它自动获取分类变量并对它们进行虚拟编码(类似于 sklearn 上的单热编码),但将输出列的数量减少为唯一值长度的 log2。

基本上,当我使用这个库时,我注意到我的虚拟变量仅限于几个唯一值。经过进一步调查,我注意到这个 @staticmethod,它采用分类变量中唯一值的 len 的 log2。

我的问题是为什么?我意识到这会降低输出数据的维度,但这样做背后的逻辑是什么?取log2如何确定需要多少位数来表示数据?

 def calc_required_digits(X, col):
"""
figure out how many digits we need to represent the classes present
"""
return int( np.ceil(np.log2(len(X[col].unique()))) )

完整源代码:

"""Binary encoding"""

import copy
import pandas as pd
import numpy as np
from sklearn.base import BaseEstimator, TransformerMixin
from category_encoders.ordinal import OrdinalEncoder
from category_encoders.utils import get_obj_cols, convert_input

__author__ = 'willmcginnis'


[docs]class BinaryEncoder(BaseEstimator, TransformerMixin):
"""Binary encoding for categorical variables, similar to onehot, but stores categories as binary bitstrings.

Parameters
----------

verbose: int
integer indicating verbosity of output. 0 for none.
cols: list
a list of columns to encode, if None, all string columns will be encoded
drop_invariant: bool
boolean for whether or not to drop columns with 0 variance
return_df: bool
boolean for whether to return a pandas DataFrame from transform (otherwise it will be a numpy array)
impute_missing: bool
boolean for whether or not to apply the logic for handle_unknown, will be deprecated in the future.
handle_unknown: str
options are 'error', 'ignore' and 'impute', defaults to 'impute', which will impute the category -1. Warning: if
impute is used, an extra column will be added in if the transform matrix has unknown categories. This can causes
unexpected changes in dimension in some cases.

Example
-------
>>>from category_encoders import *
>>>import pandas as pd
>>>from sklearn.datasets import load_boston
>>>bunch = load_boston()
>>>y = bunch.target
>>>X = pd.DataFrame(bunch.data, columns=bunch.feature_names)
>>>enc = BinaryEncoder(cols=['CHAS', 'RAD']).fit(X, y)
>>>numeric_dataset = enc.transform(X)
>>>print(numeric_dataset.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 506 entries, 0 to 505
Data columns (total 16 columns):
CHAS_0 506 non-null int64
RAD_0 506 non-null int64
RAD_1 506 non-null int64
RAD_2 506 non-null int64
RAD_3 506 non-null int64
CRIM 506 non-null float64
ZN 506 non-null float64
INDUS 506 non-null float64
NOX 506 non-null float64
RM 506 non-null float64
AGE 506 non-null float64
DIS 506 non-null float64
TAX 506 non-null float64
PTRATIO 506 non-null float64
B 506 non-null float64
LSTAT 506 non-null float64
dtypes: float64(11), int64(5)
memory usage: 63.3 KB
None

"""
def __init__(self, verbose=0, cols=None, drop_invariant=False, return_df=True, impute_missing=True, handle_unknown='impute'):
self.return_df = return_df
self.drop_invariant = drop_invariant
self.drop_cols = []
self.verbose = verbose
self.impute_missing = impute_missing
self.handle_unknown = handle_unknown
self.cols = cols
self.ordinal_encoder = None
self._dim = None
self.digits_per_col = {}

[docs] def fit(self, X, y=None, **kwargs):
"""Fit encoder according to X and y.

Parameters
----------

X : array-like, shape = [n_samples, n_features]
Training vectors, where n_samples is the number of samples
and n_features is the number of features.
y : array-like, shape = [n_samples]
Target values.

Returns
-------

self : encoder
Returns self.

"""

# if the input dataset isn't already a dataframe, convert it to one (using default column names)
# first check the type
X = convert_input(X)

self._dim = X.shape[1]

# if columns aren't passed, just use every string column
if self.cols is None:
self.cols = get_obj_cols(X)

# train an ordinal pre-encoder
self.ordinal_encoder = OrdinalEncoder(
verbose=self.verbose,
cols=self.cols,
impute_missing=self.impute_missing,
handle_unknown=self.handle_unknown
)
self.ordinal_encoder = self.ordinal_encoder.fit(X)

for col in self.cols:
self.digits_per_col[col] = self.calc_required_digits(X, col)

# drop all output columns with 0 variance.
if self.drop_invariant:
self.drop_cols = []
X_temp = self.transform(X)
self.drop_cols = [x for x in X_temp.columns.values if X_temp[x].var() <= 10e-5]



return self


[docs] def transform(self, X):
"""Perform the transformation to new categorical data.

Parameters
----------

X : array-like, shape = [n_samples, n_features]

Returns
-------

p : array, shape = [n_samples, n_numeric + N]
Transformed values with encoding applied.

"""

if self._dim is None:
raise ValueError('Must train encoder before it can be used to transform data.')

# first check the type
X = convert_input(X)

# then make sure that it is the right size
if X.shape[1] != self._dim:
raise ValueError('Unexpected input dimension %d, expected %d' % (X.shape[1], self._dim, ))

if not self.cols:
return X

X = self.ordinal_encoder.transform(X)

X = self.binary(X, cols=self.cols)

if self.drop_invariant:
for col in self.drop_cols:
X.drop(col, 1, inplace=True)

if self.return_df:
return X
else:
return X.values


[docs] def binary(self, X_in, cols=None):
"""
Binary encoding encodes the integers as binary code with one column per digit.
"""

X = X_in.copy(deep=True)

if cols is None:
cols = X.columns.values
pass_thru = []
else:
pass_thru = [col for col in X.columns.values if col not in cols]

bin_cols = []
for col in cols:
# get how many digits we need to represent the classes present
digits = self.digits_per_col[col]

# map the ordinal column into a list of these digits, of length digits
X[col] = X[col].map(lambda x: self.col_transform(x, digits))

for dig in range(digits):
X[str(col) + '_%d' % (dig, )] = X[col].map(lambda r: int(r[dig]) if r is not None else None)
bin_cols.append(str(col) + '_%d' % (dig, ))

X = X.reindex(columns=bin_cols + pass_thru)

return X


[docs] @staticmethod
def calc_required_digits(X, col):
"""
figure out how many digits we need to represent the classes present
"""
return int( np.ceil(np.log2(len(X[col].unique()))) )




[docs] @staticmethod
def col_transform(col, digits):
"""
The lambda body to transform the column values
"""

if col is None or float(col) < 0.0:
return None
else:

col = list("{0:b}".format(int(col)))
if len(col) == digits:
return col
else:
return [0 for _ in range(digits - len(col))] + col

最佳答案

My question is WHY? I realize that this reduces the dimensionality of the output data, but what is the logic behind doing this?

基本上,分类编码的问题是让您的算法处理分类特征。因此,有几种方法可以做到这一点,包括二进制编码。实际上,如果你理解的话,它的逻辑接近于 One Hot Encoding (OHE) 的逻辑。

对于二进制编码,分类向量中的每个唯一标签随机关联到 (0) 和(唯一标签数 - 1)之间的数字。现在,您将此数字编码为基数 2,并通过新创建的列将之前的数字“转录”为 0 和 1。例如,假设您的数据集具有三个不同的标签:“A”、“B”和“C”。
下面是随机建立的对应关系:

'A' -> 1 -> 01;

'B' -> 2 > 10;

'C' -> 0 -> 00。

因此,给定数据集的编码示例是:

索引 my_category enc_category_0 enc_category_1

0 一个, 1, 0

1, B, 0, 1

2, C, 0, 0

3 A, 1, 0

关于它的实用性,正如你所说,它是降低维度的。此外,我想它有助于在编码列中没有太多的零,就像 OHE 一样。这是一个有趣的帖子:https://medium.com/data-design/visiting-categorical-features-and-encoding-in-decision-trees-53400fa65931

How does taking the log2 determine how many digits are needed to represent the data? If you understood the working principle, you understand the use of the log2. Computing the log2 of a number retrives the necessary number of digits for a binary encoding of this number. Example: [log2(10)]=[3.32]=4, 4 digits are needed for binary encode 10.

有关实现和代码示例的更多信息:http://contrib.scikit-learn.org/categorical-encoding/_modules/category_encoders/binary.html#BinaryEncoder

希望我说清楚了,

关于binary-data - 这个二进制编码器的功能是如何工作的?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/47063221/

24 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com