gpt4 book ai didi

python - 如何在 python 中的管道中结合 LabelBinarizer 和 OneHotEncoder 来处理分类变量?

转载 作者:行者123 更新时间:2023-11-30 08:58:46 24 4
gpt4 key购买 nike

过去几天我在 stackoverflow 上查找了正确的教程和问答,但没有找到正确的指南,主要是因为显示 LabelBinarizer 或 OneHotEncoder 用例的示例没有显示它如何合并到管道中,反之亦然。反之亦然。

我有一个包含 4 个变量的数据集:

num1    num2    cate1    cate2
3 4 Cat 1
9 23 Dog 0
10 5 Dog 1

num1 和 num2 是数值变量,cate1 和 cate2 是分类变量。我知道在拟合 ML 算法之前我需要以某种方式对分类变量进行编码,但我不太确定在多次尝试后如何在管道中做到这一点。

from sklearn.pipeline import Pipeline, FeatureUnion, make_pipeline
from sklearn.preprocessing import StandardScaler, OneHotEncoder, LabelBinarizer

# Class that identifies Column type
class Columns(BaseEstimator, TransformerMixin):
def __init__(self, names=None):
self.names = names
def fit (self, X, y=None, **fit_params):
return self
def transform(self, X):
return X[self.names]

# Separate target from training features
y = df['MED']
X = df.drop('MED', axis=1)

X_selected = X.filter(['num1', 'num2', 'cate1', 'cate2'])

# from the selected X, further choose categorical only
X_selected_cat = X_selected.filter(['cate1', 'cate2']) # hand selected since some cat var has value 0, 1

# Find the numerical columns, exclude categorical columns
X_num_cols = X_selected.columns[X_selected.dtypes.apply(lambda c: np.issubdtype(c, np.number))] # list of numeric column names, automated here
X_cat_cols = X_selected_cat.columns # list of categorical column names, previously hand-slected

# Split data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X_selected, y,
test_size=0.5,
random_state=567,
stratify=y)

# Pipeline
pipe = Pipeline([
("features", FeatureUnion([
('numeric', make_pipeline(Columns(names=X_num_cols),StandardScaler())),
('categorical', make_pipeline(Columns(names=X_cat_cols)))
])),
('LR_model', LogisticRegression()),
])

这给了我错误ValueError:无法将字符串转换为 float :'Cat'

用此替换最后第四行

('categorical', make_pipeline(Columns(names=X_cat_cols),OneHotEncoder()))

会给我相同的ValueError:无法将字符串转换为 float :'Cat'

用此替换最后第四行

('categorical', make_pipeline(Columns(names=X_cat_cols),LabelBinarizer(),OneHotEncoder()))
])),

会给我一个不同的错误TypeError:fit_transform()需要2个位置参数,但给出了3个

用此替换最后第四行

('numeric', make_pipeline(Columns(names=X_num_cols),LabelBinarizer())),

会给我这个错误TypeError:fit_transform()需要2个位置参数,但给出了3个

最佳答案

根据 Marcus 的建议,我尝试但无法安装 scikit-learn dev 版本,但发现了类似的东西,名为 category_encoders .

将代码更改为这样即可:

from sklearn.pipeline import Pipeline, FeatureUnion, make_pipeline
from sklearn.preprocessing import StandardScaler, OneHotEncoder, LabelBinarizer
import category_encoders as CateEncoder

# Class that identifies Column type
class Columns(BaseEstimator, TransformerMixin):
def __init__(self, names=None):
self.names = names
def fit (self, X, y=None, **fit_params):
return self
def transform(self, X):
return X[self.names]

# Separate target from training features
y = df['MED']
X = df.drop('MED', axis=1)

X_selected = X.filter(['num1', 'num2', 'cate1', 'cate2'])

# from the selected X, further choose categorical only
X_selected_cat = X_selected.filter(['cate1', 'cate2']) # hand selected since some cat var has value 0, 1

# Find the numerical columns, exclude categorical columns
X_num_cols = X_selected.columns[X_selected.dtypes.apply(lambda c: np.issubdtype(c, np.number))] # list of numeric column names, automated here
X_cat_cols = X_selected_cat.columns # list of categorical column names, previously hand-slected

# Split data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X_selected, y,
test_size=0.5,
random_state=567,
stratify=y)

# Pipeline
pipe = Pipeline([
("features", FeatureUnion([
('numeric', make_pipeline(Columns(names=X_num_cols),StandardScaler())),
('categorical', make_pipeline(Columns(names=X_cat_cols),CateEncoder.BinaryEncoder()))
])),
('LR_model', LogisticRegression()),
])

关于python - 如何在 python 中的管道中结合 LabelBinarizer 和 OneHotEncoder 来处理分类变量?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/49018652/

24 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com