gpt4 book ai didi

python - 是否可以在 OneHotEncoder 中为某些列指定 handle_unknown = 'ignore' 并为其他列指定 'error'?

转载 作者:太空狗 更新时间:2023-10-30 02:06:57 28 4
gpt4 key购买 nike

我有一个包含所有分类列的数据框,我正在使用 sklearn.preprocessing 中的 oneHotEncoder 对其进行编码。我的代码如下:

from sklearn.preprocessing import OneHotEncoder
from sklearn.pipeline import Pipeline


steps = [('OneHotEncoder', OneHotEncoder(handle_unknown ='ignore')) ,('LReg', LinearRegression())]

pipeline = Pipeline(steps)

正如在 OneHotEncoder 中看到的,handle_unknown 参数采用errorignore。我想知道是否有一种方法可以选择性地忽略某些列的未知类别,而对其他列给出错误?

import pandas as pd

df = pd.DataFrame({'Country':['USA','USA','IND','UK','UK','UK'],
'Fruits':['Apple','Strawberry','Mango','Berries','Banana','Grape'],
'Flower': ['Rose','Lily','Orchid','Petunia','Lotus','Dandelion'],
'Result':[1,2,3,4,5,6,]})

from sklearn.preprocessing import OneHotEncoder
from sklearn.linear_model import LinearRegression
from sklearn.pipeline import Pipeline

steps = [('OneHotEncoder', OneHotEncoder(handle_unknown ='ignore')) ,('LReg', LinearRegression())]

pipeline = Pipeline(steps)

from sklearn.model_selection import train_test_split

X = df[["Country","Flower","Fruits"]]
Y = df["Result"]
X_train, X_test, y_train, y_test = train_test_split(X,Y,test_size=0.3, random_state=30, shuffle =True)

print("X_train.shape:", X_train.shape)
print("y_train.shape:", y_train.shape)
print("X_test.shape:", X_test.shape)
print("y_test.shape:", y_test.shape)

pipeline.fit(X_train,y_train)

y_pred = pipeline.predict(X_test)

from sklearn.metrics import mean_squared_error
from sklearn.metrics import r2_score

#Mean Squared Error:
MSE = mean_squared_error(y_test,y_pred)

print("MSE", MSE)

#Root Mean Squared Error:
from math import sqrt

RMSE = sqrt(MSE)
print("RMSE", RMSE)

#R-squared score:
R2_score = r2_score(y_test,y_pred)

print("R2_score", R2_score)

在这种情况下,对于 CountryFruitsFlowers 的所有列,如果有新值出现,模型仍然会能够预测输出。

我想知道是否有一种方法可以忽略 FruitsFlowers 的未知类别,但是会针对 Country 中的未知值引发错误> 专栏?

最佳答案

我认为ColumnTransformer()会帮你解决问题。您可以指定列表您要为其应用 OneHotEncoder 的列,对于 handle_unknown 应用 ignore,对于 error 也类似。

使用 ColumnTransformer 将您的管道转换为以下内容

from sklearn.compose import ColumnTransformer

ct = ColumnTransformer([("ohe_ignore", OneHotEncoder(handle_unknown ='ignore'),
["Flower", "Fruits"]),
("ohe_raise_error", OneHotEncoder(handle_unknown ='error'),
["Country"])])

steps = [('OneHotEncoder', ct),
('LReg', LinearRegression())]

pipeline = Pipeline(steps)

现在,当我们想要预测的时候

>>> pipeline.predict(pd.DataFrame({'Country': ['UK'], 'Fruits': ['Apple'], 'Flower': ['Rose']}))

array([2.83333333])

>>> pipeline.predict(pd.DataFrame({'Country': ['UK'], 'Fruits': ['chk'], 'Flower': ['Rose']}))

array([3.66666667])


>>> pipeline.predict(pd.DataFrame({'Country': ['chk'], 'Fruits': ['Apple'], 'Flower': ['Rose']}))

> ValueError: Found unknown categories ['chk'] in column 0 during
> transform

注意:ColumnTransformer 从版本 0.20 开始可用。

关于python - 是否可以在 OneHotEncoder 中为某些列指定 handle_unknown = 'ignore' 并为其他列指定 'error'?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/56604811/

28 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com