gpt4 book ai didi

python - OneHotEncoder 在调用 SimpleImputer 后引发 NaN 问题

转载 作者:行者123 更新时间:2023-12-04 15:39:32 24 4
gpt4 key购买 nike

我无法理解流水线在 Sklearn 中的工作方式。以下是使用 titanic 数据集的示例。

data = pd.read_csv('datasets/train.csv')

cat_attribs = ["Embarked", "Cabin", "Ticket", "Name"]

num_pipeline = Pipeline([
('imputer', SimpleImputer(strategy="median")),
])


str_pipeline = Pipeline([
('imputer', SimpleImputer(strategy="most_frequent")),
])


full_pipeline = ColumnTransformer([
("num", num_pipeline, ["Pclass", "Age", "SibSp", "Parch", "Fare"]),
("str", str_pipeline, ["Cabin", "Sex"]),
("cat", OneHotEncoder(), ["Cabin"]),
])

full_pipeline.fit_transform(data)

我希望这会填充所有缺失的 NaN 值(包括数字和字符串)属性,然后最终将 Cabin 属性转换为数字。

相反,代码以以下错误结束:

ValueError: Input contains NaN. If I remove the line calling the OneHotEncoder and printing the transformed array, there is no NaN value.

所以我想知道。在这种情况下,我应该如何调用 OneHotEncoder

最佳答案

我建议将 OneHotEncoder 应用于所有分类变量。因此,将其作为一个单独的管道。

由于它是数字列的单步过程,您可以直接使用 ColumnTransformer

试试这个!

from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer, make_column_transformer
from sklearn.preprocessing import OneHotEncoder
from sklearn.pipeline import Pipeline, make_pipeline

cat_preprocess = make_pipeline(SimpleImputer(strategy="most_frequent"), OneHotEncoder())

ct = make_column_transformer([
("num", SimpleImputer(strategy="median"), ["Pclass", "Age", "SibSp", "Parch", "Fare"]),
("str", cat_preprocess, ["Cabin", "Sex"]),
])

pipeline = Pipeline([('preprocess', ct)])

关于python - OneHotEncoder 在调用 SimpleImputer 后引发 NaN 问题,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/58372334/

24 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com