gpt4 book ai didi

python - 随机森林过拟合

转载 作者:太空狗 更新时间:2023-10-30 00:31:38 24 4
gpt4 key购买 nike

我使用带有分层 CV 的 scikit-learn 来比较一些分类器。我在计算:准确性、召回率、auc。

我使用 5 CV 进行参数优化 GridSearchCV。

RandomForestClassifier(warm_start= True, min_samples_leaf= 1, n_estimators= 800, min_samples_split= 5,max_features= 'log2', max_depth= 400, class_weight=None)

是来自 GridSearchCV 的 best_params。

我的问题,我觉得我真的过拟合了。例如:

Random Forest with standard deviation (+/-)

  • precision: 0.99 (+/- 0.06)
  • sensitivity: 0.94 (+/- 0.06)
  • specificity: 0.94 (+/- 0.06)
  • B_accuracy: 0.94 (+/- 0.06)
  • AUC: 0.94 (+/- 0.11)

Logistic Regression with standard deviation (+/-)

  • precision: 0.88(+/- 0.06)
  • sensitivity: 0.79 (+/- 0.06)
  • specificity: 0.68 (+/- 0.06)
  • B_accuracy: 0.73 (+/- 0.06)
  • AUC: 0.73 (+/- 0.041)

其他的看起来也像逻辑回归(所以它们看起来没有过度拟合)。

我的简历代码是:

for i,j in enumerate(data):
X.append(data[i][0])
y.append(float(data[i][1]))
x=np.array(X)
y=np.array(y)

def SD(values):

mean=sum(values)/len(values)
a=[]
for i in range(len(values)):
a.append((values[i]-mean)**2)
erg=sum(a)/len(values)
SD=math.sqrt(erg)
return SD,mean

for name, clf in zip(titles,classifiers):
# go through all classifiers, compute 10 folds
# the next for loop should be 1 tab indent more, coudlnt realy format it here, sorry
pre,sen,spe,ba,area=[],[],[],[],[]
for train_index, test_index in skf:
#print train_index, test_index
#get the index from all train_index and test_index
#change them to list due to some errors
train=train_index.tolist()
test=test_index.tolist()
X_train=[]
X_test=[]
y_train=[]
y_test=[]
for i in train:
X_train.append(x[i])

for i in test:
X_test.append(x[i])

for i in train:
y_train.append(y[i])

for i in test:
y_test.append(y[i])


#clf=clf.fit(X_train,y_train)
#predicted=clf.predict_proba(X_test)
#... other code, calculating metrics and so on...
print name
print("precision: %0.2f \t(+/- %0.2f)" % (SD(pre)[1], SD(pre)[0]))
print("sensitivity: %0.2f \t(+/- %0.2f)" % (SD(sen)[1], SD(pre)[0]))
print("specificity: %0.2f \t(+/- %0.2f)" % (SD(spe)[1], SD(pre)[0]))
print("B_accuracy: %0.2f \t(+/- %0.2f)" % (SD(ba)[1], SD(pre)[0]))
print("AUC: %0.2f \t(+/- %0.2f)" % (SD(area)[1], SD(area)[0]))
print "\n"

如果我使用 scores = cross_validation.cross_val_score(clf, X, y, cv=10, scoring='accuracy') 方法,我不会得到这个“过度拟合”值.那么也许我使用的 CV 方法有问题?但它仅适用于 RF...

由于 cross_val_function 中特异性评分函数的滞后,我自己做了。

最佳答案

赫伯特

如果您的目标是比较不同的学习算法,我建议您使用嵌套交叉验证。 (我将学习算法称为不同的算法,例如逻辑回归、决策树和其他从您的训练数据中学习假设或模型(最终分类器)的判别模型。

如果您想调整单个算法的超参数,“常规”交叉验证很好。然而,一旦您开始使用相同的交叉验证参数/折叠运行超参数优化,您的性能估计可能会过于乐观。如果您一遍又一遍地运行交叉验证,那么您的测试数据将在某种程度上成为“训练数据”。

实际上,人们经常问我这个问题,我将从我在此处发布的常见问题解答部分摘录一些内容:http://sebastianraschka.com/faq/docs/evaluate-a-model.html

In nested cross-validation, we have an outer k-fold cross-validation loop to split the data into training and test folds, and an inner loop is used to select the model via k-fold cross-validation on the training fold. After model selection, the test fold is then used to evaluate the model performance. After we have identified our "favorite" algorithm, we can follow-up with a "regular" k-fold cross-validation approach (on the complete training set) to find its "optimal" hyperparameters and evaluate it on the independent test set. Let's consider a logistic regression model to make this clearer: Using nested cross-validation you will train m different logistic regression models, 1 for each of the m outer folds, and the inner folds are used to optimize the hyperparameters of each model (e.g., using gridsearch in combination with k-fold cross-validation. If your model is stable, these m models should all have the same hyperparameter values, and you report the average performance of this model based on the outer test folds. Then, you proceed with the next algorithm, e.g., an SVM etc.

enter image description here

我只能强烈推荐这篇更详细地讨论这个问题的优秀论文:

PS:通常,您不需要/不想调整随机森林的超参数(如此广泛)。随机森林(一种装袋形式)背后的想法实际上是不修剪决策树——实际上,Breiman 提出随机森林算法的一个原因是处理单个决策树的修剪问题/过度拟合。因此,您真正需要“担心”的唯一参数是树的数量(可能还有每棵树的随机特征数量)。但是,通常情况下,您最好采用大小为 n 的训练自举样本(其中 n 是训练集中特征的原始数量)和平方根 (m) 特征(其中 m 是训练集的维数)。

希望这对您有所帮助!

编辑:

通过 scikit-learn 进行嵌套 CV 的一些示例代码:

pipe_svc = Pipeline([('scl', StandardScaler()),
('clf', SVC(random_state=1))])

param_range = [0.0001, 0.001, 0.01, 0.1, 1.0, 10.0, 100.0, 1000.0]

param_grid = [{'clf__C': param_range,
'clf__kernel': ['linear']},
{'clf__C': param_range,
'clf__gamma': param_range,
'clf__kernel': ['rbf']}]


# Nested Cross-validation (here: 5 x 2 cross validation)
# =====================================
gs = GridSearchCV(estimator=pipe_svc,
param_grid=param_grid,
scoring='accuracy',
cv=5)
scores = cross_val_score(gs, X_train, y_train, scoring='accuracy', cv=2)
print('CV accuracy: %.3f +/- %.3f' % (np.mean(scores), np.std(scores)))

关于python - 随机森林过拟合,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/33948946/

24 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com