gpt4 book ai didi

python - 将分类数据传递给 Sklearn 决策树

转载 作者:IT老高 更新时间:2023-10-28 20:33:28 24 4
gpt4 key购买 nike

有几篇关于如何将分类数据编码到 Sklearn 决策树的帖子,但是从 Sklearn 文档中,我们得到了这些

Some advantages of decision trees are:

(...)

Able to handle both numerical and categorical data. Other techniques are usually specialized in analyzing datasets that have only one type of variable. See the algorithms for more information.

但运行以下脚本

import pandas as pd 
from sklearn.tree import DecisionTreeClassifier

data = pd.DataFrame()
data['A'] = ['a','a','b','a']
data['B'] = ['b','b','a','b']
data['C'] = [0, 0, 1, 0]
data['Class'] = ['n','n','y','n']

tree = DecisionTreeClassifier()
tree.fit(data[['A','B','C']], data['Class'])

输出如下错误:

Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/local/lib/python2.7/site-packages/sklearn/tree/tree.py", line 154, in fit
X = check_array(X, dtype=DTYPE, accept_sparse="csc")
File "/usr/local/lib/python2.7/site-packages/sklearn/utils/validation.py", line 377, in check_array
array = np.array(array, dtype=dtype, order=order, copy=copy)
ValueError: could not convert string to float: b

我知道在 R 中可以通过 Sklearn 传递分类数据,这可能吗?

最佳答案

(这只是 my comment above 从 2016 年开始的重新格式化......它仍然适用。)

这个问题的公认答案具有误导性。

目前,sklearn 决策树不处理分类数据 - see issue #5442 .

使用标签编码的推荐方法转换为整数,DecisionTreeClassifier() 会将 视为数字。如果您的分类数据不是有序的,那就不好了 - 您最终会得到没有意义的拆分。

使用 OneHotEncoder 是当前唯一有效的方法,允许任意拆分不依赖于标签排序,但计算量很大。

关于python - 将分类数据传递给 Sklearn 决策树,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/38108832/

24 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com