python - 为什么我的 Python RandomForestRegressor 不能准确预测训练集数据？-6ren

python - 为什么我的 Python RandomForestRegressor 不能准确预测训练集数据？

转载作者：行者123 更新时间：2023-11-28 19:17:16

25

4

我正在学习机器学习，我想使用 scikit-learn 的 RandomForestRegressor() 在相当复杂的数据集上。不过，为了首先掌握它的窍门，我正在尝试完成一个基本示例，如下所示:

import sklearn.ensemble as se
import numpy as np
forest = se.RandomForestRegressor(n_estimators=1000)
traindata = np.arange(1000).reshape(200,5)
forest = forest.fit(traindata[0::,1::],traindata[0::,0])

此时，我认为我所做的是:我创建了一个 200 行矩阵，每行 5 个值，格式为 [ x, x+1, x+2, x+3, x+4 ]其中 x是 5 的倍数(例如 [0,1,2,3,4]、[5,6,7,8,9] 等)。

我已经告诉我的森林适合这些特征 [ x+1, x+2, x+3, x+4 ]预测 x 。以下是我预测时会发生的情况:

forest.predict([1,2,3,4]) >> array([2.785])
这对我来说真的很不直观。考虑 [1,2,3,4] 的特征值在 x = 0 的训练数据中，我的森林难道不能比 2.785 更准确地预测它吗？
我更进一步看到特征重要性如下:

forest.feature_importances_ >> array([0.26349716, 0.23664264, 0.23360533, 0.26625487])
对我来说，这并不意味着我所看到的方式存在重大偏差。我在这里错过了什么？

最佳答案

为什么预测不准确？

简短版本:由于聪明的 Breiman 提出的方法的性质。

更长的版本:

随机森林是非常有趣的学习器。

但是，您需要一点耐心才能调整它们。

forest.setp_param( oob_score = True, # set True to be able to read # # oob-samples score random_state = 2015 # set so as to keep retesting # # possible / meaniningfull on # # an otherwise randomised # # learner construction )

原则上，任何尝试使用 .fit() 方法在幕后做了很多工作来构建一组随机的决策树，使其成为适用于您的数据集的 RandomForest。

的“质量” .fit() 表示在 .oob_score_ 这显示了已经使用的 oob 的(准确)程度。 -samples(Breiman 方法的真实部分)在针对给定的 完成训练后 RandomForest 。这有助于您估计您受过训练的 RandomForest 如何“好”或“差”在可用数据集上执行。

然而，更重要的是(或应该是)学习者的概括能力——即，一旦处理了一个看不见的例子，它的预测能力与现实的吻合程度如何。

这个可以通过 .score()测试 训练有素的方法 RandomForest -实例。

RandomForest 是一个“多数投票”预测器，要感受这一点，请尝试显示随机树大军的内部状态:

def printLDF( aPopulationSET ): LDF_example, LDF_counts = np.unique( aPopulationSET, return_counts = True ) GDF_sum_scaler = float( LDF_counts.sum() ) for i in xrange( LDF_example.shape[0] ): print "{0: > 6d}: {1: > 6d} x {2: > 15.2f} {3: > 15.4f} % {4: > 15.1f} %".format( i, LDF_counts[i], LDF_example[i], 100 * LDF_counts[i] / GDF_sum_scaler, 100 * LDF_counts[:i].sum() / GDF_sum_scaler ) return >>> printLDF( forest.estimators_[:].predict( anExample ) )
这将向您显示单个树的预测，用于整个基于森林的预测的多数票计算。
这意味着，除此之外， RandomForest 原则上永远不会预测训练中存在的“访问”值范围“之外”的值(不能通过设计“推断”)。
如何让它变得更好？
嗯，特征工程是关键。如果您知道 RandomForest 是您案例的可行学习器，并且您觉得它观察到的预测能力很差，那么首先要归咎于特征选择。
检查森林
检查学习者的内部状态——检查森林中的树木做了什么:
您可能会通过以下方式更深入地了解模型:

def prediction_up_dn_intervals( aPredictorMODEL, # >>> http://blog.datadive.net/prediction-intervals-for-random-forests/ X_, # aStateVECTOR: X_sampled aPredictorOutputIDX = 0, # (4,2,2) -> singleQUAD ( LONG.TP/SL, SHORT.TP/SL ) <-- idxMAP( 'LONG', 'TP', 1 ) aRequiredPercentile = 95 ): err_dn = [] err_up = [] #----------------------------------------------------------------------------------------------- if len( X_.shape ) == 1: # for a single X_example run preds = [] for pred in aPredictorMODEL.estimators_: preds.append( pred.predict( X_ )[0,aPredictorOutputIDX] ) # de-array-ification err_dn.append( np.percentile( preds, ( 100 - aRequiredPercentile ) / 2. ) ) err_up.append( np.percentile( preds, 100 - ( 100 - aRequiredPercentile ) / 2. ) ) else: #------------------------------------------------------------------------------------------ for x in xrange( len( X_ ) ): # for a multi X_example run preds = [] for pred in aPredictorMODEL.estimators_: preds.append( pred.predict( X_[x] )[0,aPredictorOutputIDX] ) # de-array-ification err_dn.append( np.percentile( preds, ( 100 - aRequiredPercentile ) / 2. ) ) err_up.append( np.percentile( preds, 100 - ( 100 - aRequiredPercentile ) / 2. ) ) #----------------------------------------------------------------------------------------------- return err_up, err_dn #numba.jit( 'f8(<<OBJECT>>,f8[:,:],f8[:,:],i8,f8)' ) # <<OBJECT>> prevents JIT def getPredictionsOnINTERVAL( aPredictorENGINE, # a MULTI-OBJECTIVE PREDICTOR -> a singleQUAD or a full 4-QUAD (16,0) <-(4,2,2) X_, y_GndTRUTH, # (4,2,2) -> (16,0) a MULTI-OBJECTIVE PREDICTOR aPredictionIDX = 0, # (4,2,2) -> singleQUAD ( LONG.TP/SL, SHORT.TP/SL ) <-- idxMAP( 'LONG', 'TP', 1 ) percentile = 75 ): """ |>>> getPredictionsOnINTERVAL( loc_PREDICTOR, X_sampled, y_sampled, idxMAP( "LONG", "TP", 1 ), 75 ) 1.0 +0:01:29.375000 |>>> getPredictionsOnINTERVAL( loc_PREDICTOR, X_sampled, y_sampled, idxMAP( "LONG", "TP", 1 ), 55 ) 0.9992532724237898 +0:03:59.922000 |>>> getPredictionsOnINTERVAL( loc_PREDICTOR, X_sampled, y_sampled, idxMAP( "LONG", "TP", 1 ), 50 ) 0.997100939998243 +0:09:16.328000 |>>> getPredictionsOnINTERVAL( loc_PREDICTOR, X_sampled, y_sampled, idxMAP( "LONG", "TP", 1 ), 5 ) 0.31375735746288325 +0:01:16.422000 """ correct_on_interval = 0 # correct = 0. ____________________- faster to keep asINTEGER ... +=1 and only finally make DIV on FLOAT(s) in RET #ruth = y_ # Y[idx[trainsize:]] err_up, err_dn = prediction_up_dn_intervals( aPredictorENGINE, # ( rf, X_, # X[idx[trainsize:]], aPredictionIDX, # idxMAP( "LONG", "TP", 1 ), percentile # percentile = 90 ) # ) #-------------------------------------------------------------------# for a single X_ run if ( len( X_.shape ) == 1 ): if ( err_dn[0] <= y_GndTRUTH[aPredictionIDX] <= err_up[0] ): return 1. else: return 0. #-------------------------------------------------------------------# for a multi X_ run for i, val in enumerate( y_GndTRUTH[:,aPredictionIDX] ): # enumerate( truth ) if err_dn[i] <= val <= err_up[i]: correct_on_interval += 1 #------------------------------------------------------------------- return correct_on_interval / float( y_GndTRUTH.shape[0] ) # print correct / len( truth ) def mapPredictionsOnINTERVAL( aPredictorENGINE, # X_, y_GndTRUTH, aPredictionIDX = 0, aPercentilleSTEP = 5 ): for aPercentille in xrange( aPercentilleSTEP, 100, aPercentilleSTEP ): Quotient = getPredictionsOnINTERVAL( aPredictorENGINE, X_, y_GndTRUTH, aPredictionIDX, aPercentille ) print "{0: > 3d}-percentil {1: > 6.3f} %".format( aPercentille, 100 * Quotient ) """ 5% 0.313757 10% 0.420847 15% 0.510191 20% 0.628481 25% 0.719758 30% 0.839058 35% 0.909646 40% 0.963454 45% 0.986603 50% 0.997101 55% 0.999253 60% 0.999912 65% 1.000000 >>> RET/JIT 70% 1.000000 xxxxxxxxxxxxxx 75% 1.000000 xxxxxxxxxxxxxx ???? .fit( X_, y_[:,8:12] ) # .fit() on HORIZON-T0+3???? ... y_GndTRUTH.shape[1] v/s .predict().shape[1] """ if ( Quotient == 1 ): return

关于python - 为什么我的 Python RandomForestRegressor 不能准确预测训练集数据？，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/32103301/

25

4

0

文章推荐： python - 单个django查询集获取n个相邻项目

文章推荐： iphone - iOS & MVC - 如何创建这个应用程序/游戏

文章推荐： Python pymssql连接字符串使用变量

javascript - (不能)在javascript中获取元素样式的一部分
我是 javascript 的新手(今天开始弄乱它)。我正在尝试更改名为“bar”的元素(div)的高度。条形图将成为图表的一部分。我可以毫无问题地将按钮连接到更改栏高度的函数。一切正常，除了条形
ios - 不能 "addSubView"
错误 -> “UIVIew”没有名为“addSubView”的成员 override func viewDidLoad() { super.viewDidLoad() // Do an
swift - 不能 CGEventTapCreate
我在命令行工具项目中复制并粘贴了 main.swift 下面链接中的代码。 How do you use CGEventTapCreate in Swift? 它构建没有错误，但是当我运行时， gua
c++ - 不能 dynamic_cast
我在尝试编译我的代码时遇到以下错误。 ERROR! ..\myCode\CPOI.cpp:68:41: error: cannot dynamic_cast 'screenType' (of type
不能 strcat_s 多个字符到字符串指针
我正在尝试将多个字符串连接到一个我已为其分配内存的字符串指针。这是一个例子: char *finalNumString = malloc(sizeof(char)*1024); finalNumStr
不能 dup2 将管道的末端写入标准输出
我在使用 dup2() 和 pipe() 时遇到问题。当我尝试将管道的写入端 dup2 到 STDOUT_FILENO 时，我收到了 EBADF。我用 gdb 在 dup2(pout[1], ST
Git:不能 pull
首先，我应该说我运行的是 Windows 7。因此，今天早上我尝试像往常一样从我的存储库中提取数据，但我做不到。我得到了错误: The authenticity of host 'github.co
python - 不能 "activate"virtualenv
刚开始在虚拟环境中运行Python，乱用Django，无法激活虚拟环境。花了最后 4 个小时尝试在本地终端/VS 代码上激活虚拟环境 (venv)，但没有成功。避免使用“sudo pip inst
r - 数据框可以做什么而 tibble 不能？
Tidyverse 的粉丝经常给出使用小标题而不是数据框的几个优点。它们中的大多数似乎旨在保护用户免于犯错误。例如，与数据框不同，小标题: 不需要 ,drop=FALSE不从数据中删除维度的论据。不
javascript - 不能 Dockerize Elm
我一直在对 Elm 应用程序进行 docker 化时遇到问题。据我所知，我已经创建了一个完整且有效的 Docker 文件……但它不起作用。我会解释的。所以我的脚本在 3 个文件中运行。首先是启动
java - 不能 Mockbean HttpServletResponse
我可以在 Controller 中使用@Autowired，例如 @RestController public class Index { @Autowired HttpServlet
function - 不能 `compose` 和方法和函数
我定义了一个方法和一个函数: def print(str:String) = println val intToString = (n:Int) => n.toString 现在我想创作它们。我的问
javascript - 不能 .map() 一个看似数组的值
当我控制台单独记录变量“pokemons”时，它确实返回一个数组。但是当我尝试映射它时，出现错误: TypeError: pokemons.map is not a function 我的代码: im
python - 不能 `import smtplib`
每当我尝试在 Python 解释器中导入 smtplib 时，都会收到此错误: ImportError: cannot import name fix_eols 我该如何解决这个问题？编辑:这是完整
javascript - 为什么 {{#each}} 可以正常工作而 {{#with}} 不能？
我正在使用 Meteor.js 开发一个项目，但在使用 Handlebar 时遇到了一些问题:我想检索集合的最后一项，并显示字段:其中包含 html 的文本: 这是我的javascript代码: Te
java - onTouchEvent 不能@Override
你好，我想使用 Service 实现 GestureDetector 但是我有这个错误The method onTouchEvent(MotionEvent) of type GestureServi
java - 不能@Autowired接口(interface)
我正在尝试在 Controller bean 中 Autowiring 接口(interface) 在我放置的上下文配置文件中和我的 Controller 类是 @Controller pub
c++ - 不能 #include
我试图在 mainwindow.cpp 中包含 QtSvg，但是当我编译时它说无法打开包含文件:QtSvg。我已经在我的 *.pro 文件中添加了这个(QT += svg)。我可以知道可能是什么问题吗
postgresql - 不能 100% 确定这是否安全
鉴于以下 PostgreSQL 代码，我认为这段代码不容易受到 SQL 注入(inject)攻击: _filter 'day' _start 1 _end 10 _sort 'article_name
MYSQL:不能/如何将子查询用作完全合格的 CTE
我想执行以下操作。这在 MySQL 中是非法的。 PostGRESQL 中关联的 CTE(“with”子句)有效。这里的假设是 MySQL 中的子查询不是完全限定的 CTE。请注意:这个查询显然非常

首页

博学

6Ren·AI

商城

python - 为什么我的 Python RandomForestRegressor 不能准确预测训练集数据？

为什么预测不准确？

如何让它变得更好？

检查森林