gpt4 book ai didi

machine-learning - sklearn中的交叉验证+决策树

转载 作者:行者123 更新时间:2023-11-30 08:22:34 25 4
gpt4 key购买 nike

尝试使用 sklearn 和 panads 创建具有交叉验证的决策树。

我的问题是在下面的代码中,交叉验证分割数据,然后我将其用于训练和测试。我将尝试通过使用不同的最大深度设置重新创建树 n 次来找到树的最佳深度。在使用交叉验证时,我应该使用 k 折叠 CV,如果是的话,我将如何在我的代码中使用它?

import numpy as np
import pandas as pd
from sklearn import tree
from sklearn import cross_validation

features = ["fLength", "fWidth", "fSize", "fConc", "fConc1", "fAsym", "fM3Long", "fM3Trans", "fAlpha", "fDist", "class"]

df = pd.read_csv('magic04.data',header=None,names=features)

df['class'] = df['class'].map({'g':0,'h':1})

x = df[features[:-1]]
y = df['class']

x_train,x_test,y_train,y_test = cross_validation.train_test_split(x,y,test_size=0.4,random_state=0)

depth = []
for i in range(3,20):
clf = tree.DecisionTreeClassifier(max_depth=i)
clf = clf.fit(x_train,y_train)
depth.append((i,clf.score(x_test,y_test)))
print depth

这里是我正在使用的数据的链接,以防对任何人有帮助。 https://archive.ics.uci.edu/ml/datasets/MAGIC+Gamma+Telescope

最佳答案

在您的代码中,您正在创建静态训练测试分割。如果您想通过交叉验证选择最佳深度,可以在 for 循环内使用 sklearn.cross_validation.cross_val_score

您可以阅读sklearn's documentation了解更多信息。

以下是您的 CV 代码更新:

import numpy as np
import pandas as pd
from sklearn import tree
from sklearn.cross_validation import cross_val_score
from pprint import pprint

features = ["fLength", "fWidth", "fSize", "fConc", "fConc1", "fAsym", "fM3Long", "fM3Trans", "fAlpha", "fDist", "class"]

df = pd.read_csv('magic04.data',header=None,names=features)
df['class'] = df['class'].map({'g':0,'h':1})

x = df[features[:-1]]
y = df['class']

# x_train,x_test,y_train,y_test = cross_validation.train_test_split(x,y,test_size=0.4,random_state=0)
depth = []
for i in range(3,20):
clf = tree.DecisionTreeClassifier(max_depth=i)
# Perform 7-fold cross validation
scores = cross_val_score(estimator=clf, X=x, y=y, cv=7, n_jobs=4)
depth.append((i,scores.mean()))
print(depth)

或者,您可以使用 sklearn.grid_search.GridSearchCV 而不必自己编写 for 循环,特别是当您想要针对多个超参数进行优化时。

import numpy as np
import pandas as pd
from sklearn import tree
from sklearn.model_selection import GridSearchCV

features = ["fLength", "fWidth", "fSize", "fConc", "fConc1", "fAsym", "fM3Long", "fM3Trans", "fAlpha", "fDist", "class"]

df = pd.read_csv('magic04.data',header=None,names=features)
df['class'] = df['class'].map({'g':0,'h':1})

x = df[features[:-1]]
y = df['class']


parameters = {'max_depth':range(3,20)}
clf = GridSearchCV(tree.DecisionTreeClassifier(), parameters, n_jobs=4)
clf.fit(X=x, y=y)
tree_model = clf.best_estimator_
print (clf.best_score_, clf.best_params_)

编辑:更改了 GridSearchCV 的导入方式以适应 learn2day 的评论。

关于machine-learning - sklearn中的交叉验证+决策树,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/35097003/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com