gpt4 book ai didi

python - 如何仅标准化 sklearn 管道中的数字变量?

转载 作者:太空狗 更新时间:2023-10-29 19:36:22 27 4
gpt4 key购买 nike

我正在尝试通过 2 个步骤创建一个 sklearn 管道:

  1. 标准化数据
  2. 使用 KNN 拟合数据

但是,我的数据同时包含数字变量和分类变量,我已使用 pd.get_dummies 将其转换为虚拟变量。我想标准化数字变量,但让虚拟变量保持原样。我一直这样做:

X = dataframe containing both numeric and categorical columns
numeric = [list of numeric column names]
categorical = [list of categorical column names]
scaler = StandardScaler()
X_numeric_std = pd.DataFrame(data=scaler.fit_transform(X[numeric]), columns=numeric)
X_std = pd.merge(X_numeric_std, X[categorical], left_index=True, right_index=True)

但是,如果我要创建如下管道:

pipe = sklearn.pipeline.make_pipeline(StandardScaler(), KNeighborsClassifier())

它将标准化我的 DataFrame 中的所有列。有没有办法在只标准化数字列的同时做到这一点?

最佳答案

UPD: 2021-05-10

对于 sklearn >= 0.20 我们可以使用 sklearn.compose.ColumnTransformer

这是一个small example :

导入和数据加载

# Author: Pedro Morales <part.morales@gmail.com>
#
# License: BSD 3 clause

import numpy as np

from sklearn.compose import ColumnTransformer
from sklearn.datasets import fetch_openml
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split, GridSearchCV

np.random.seed(0)

# Load data from https://www.openml.org/d/40945
X, y = fetch_openml("titanic", version=1, as_frame=True, return_X_y=True)

使用 ColumnTransformer 进行管道感知数据预处理:

numeric_features = ['age', 'fare']
numeric_transformer = Pipeline(steps=[
('imputer', SimpleImputer(strategy='median')),
('scaler', StandardScaler())])

categorical_features = ['embarked', 'sex', 'pclass']
categorical_transformer = OneHotEncoder(handle_unknown='ignore')

preprocessor = ColumnTransformer(
transformers=[
('num', numeric_transformer, numeric_features),
('cat', categorical_transformer, categorical_features)])

分类

# Append classifier to preprocessing pipeline.
# Now we have a full prediction pipeline.
clf = Pipeline(steps=[('preprocessor', preprocessor),
('classifier', LogisticRegression())])

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,
random_state=0)

clf.fit(X_train, y_train)
print("model score: %.3f" % clf.score(X_test, y_test))

旧答案:

假设您有以下 DF:

In [163]: df
Out[163]:
a b c d
0 aaa 1.01 xxx 111
1 bbb 2.02 yyy 222
2 ccc 3.03 zzz 333

In [164]: df.dtypes
Out[164]:
a object
b float64
c object
d int64
dtype: object

您可以找到所有数字列:

In [165]: num_cols = df.columns[df.dtypes.apply(lambda c: np.issubdtype(c, np.number))]

In [166]: num_cols
Out[166]: Index(['b', 'd'], dtype='object')

In [167]: df[num_cols]
Out[167]:
b d
0 1.01 111
1 2.02 222
2 3.03 333

并将 StandardScaler 仅应用于那些数字列:

In [168]: scaler = StandardScaler()

In [169]: df[num_cols] = scaler.fit_transform(df[num_cols])

In [170]: df
Out[170]:
a b c d
0 aaa -1.224745 xxx -1.224745
1 bbb 0.000000 yyy 0.000000
2 ccc 1.224745 zzz 1.224745

现在您可以“一次热编码”分类(非数字)列...

关于python - 如何仅标准化 sklearn 管道中的数字变量?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/48673402/

27 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com