python - 缺失的分类数据应使用全零单热向量进行编码-6ren

python - 缺失的分类数据应使用全零单热向量进行编码

转载作者：行者123 更新时间：2023-12-05 03:29:09

我正在使用非常稀疏标记的数据进行机器学习项目。有几个分类特征，导致特征之间大约有一百个不同的类别。

例如:

0    red
1    blue
2    <missing>

color_cat = pd.DataFrame(['red', 'blue', np.NAN])
color_enc = OneHotEncoder(sparse=True, handle_unknown='ignore')
color_one_hot = color_enc.fit_transform(color_cat)

在我将这些通过 scikit 的 OneHotEncoder 后，我期望丢失的数据被编码为 00，因为文档声明 handle_unknown='ignore' 使编码器返回一个全零数组。用 [SimpleImputer][1] 替换另一个值对我来说不是一个选项。

我的期望:

0    10
1    01
2    00

相反，OneHotEncoder 将缺失值视为另一个类别。

我得到的:

0    100
1    010
2    001

我看到了相关问题:How to handle missing values (NaN) in categorical data when using scikit-learn OneHotEncoder?但是解决方案对我不起作用。我明确要求零向量。

最佳答案

从未真正使用过稀疏矩阵，但一种方法是删除与您的 nan 值对应的列。从您的模型中获取 categories_ 并在不是 nan 的地方创建一个 bool 掩码(我使用 pd.Series.notna 但可能是其他方式)并创建一个新的(或重新分配)稀疏矩阵。基本上添加到您的代码中:

# currently you have
color_one_hot
# <3x3 sparse matrix of type '<class 'numpy.float64'>'
#   with 3 stored elements in Compressed Sparse Row format>

# line of code to add
new_color_one_hot = color_one_hot[:,pd.Series(color_enc.categories_[0]).notna().to_numpy()]

# and now you have
new_color_one_hot
# <3x2 sparse matrix of type '<class 'numpy.float64'>'
#   with 2 stored elements in Compressed Sparse Row format>

# and
new_color_one_hot.todense()
# matrix([[0., 1.],
#         [1., 0.],
#         [0., 0.]])

编辑:get_dummies 也给出类似的结果 pd.get_dummies(color_cat[0], sparse=True)

编辑:仔细查看后，您可以在 OneHotEncoder 中指定参数 categories，所以如果您这样做:

color_cat = pd.DataFrame(['red', 'blue', np.nan])
color_enc = OneHotEncoder(categories=[color_cat[0].dropna().unique()],  ## here
                          sparse=True, handle_unknown='ignore')
color_one_hot = color_enc.fit_transform(color_cat)
color_one_hot.todense()
# matrix([[1., 0.],
#         [0., 1.],
#         [0., 0.]])

关于python - 缺失的分类数据应使用全零单热向量进行编码，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/71054166/

文章推荐： r - 循环遍历数据帧列表以在 R 中创建数据帧

文章推荐： oauth-2.0 - MSAL.NET OBO 刷新 token 问题

行者123

个人简介

我是一名优秀的程序员,十分优秀！

作者热门文章

滴滴打车优惠券免费领取

全站热门文章

首页

博学

6Ren·AI

商城

python - 缺失的分类数据应使用全零单热向量进行编码