Python实现随机森林RF并对比自变量的重要性

转载作者：我是一只小鸟更新时间：2023-02-16 14:31:15

27

4

本文介绍在 Python 环境中，实现随机森林（Random Forest， RF ）回归与各自变量重要性分析与排序的过程.

其中，关于基于 MATLAB 实现同样过程的代码与实战，大家可以点击查看 MATLAB实现随机森林（RF）回归与自变量影响程度分析这篇文章.

本文分为两部分，第一部分为代码的分段讲解，第二部分为完整代码.

1 代码分段讲解

1.1 模块与数据准备

首先，导入所需要的模块。在这里，需要 pydot 与 graphviz 这两个相对不太常用的模块，即使我用了 Anaconda ，也需要单独下载、安装。具体下载与安装，如果同样是在用 Anaconda ，大家就参考 Python pydot与graphviz库在Anaconda环境的配置即可.

                        
                          import pydot
import numpy as np
import pandas as pd
import scipy.stats as stats
import matplotlib.pyplot as plt
from sklearn import metrics
from openpyxl import load_workbook
from sklearn.tree import export_graphviz
from sklearn.ensemble import RandomForestRegressor

接下来，我们将代码接下来需要用的主要变量加以定义。这一部分大家先不用过于在意，浏览一下继续向下看即可；待到对应的变量需要运用时我们自然会理解其具体含义.

                        
                          train_data_path='G:/CropYield/03_DL/00_Data/AllDataAll_Train.csv'
test_data_path='G:/CropYield/03_DL/00_Data/AllDataAll_Test.csv'
write_excel_path='G:/CropYield/03_DL/05_NewML/ParameterResult_ML.xlsx'
tree_graph_dot_path='G:/CropYield/03_DL/05_NewML/tree.dot'
tree_graph_png_path='G:/CropYield/03_DL/05_NewML/tree.png'

random_seed=44
random_forest_seed=np.random.randint(low=1,high=230)

接下来，我们需要导入输入数据.

在这里需要注意，本文对以下两个数据处理的流程并没有详细涉及与讲解（因为在写本文时，我已经做过了同一批数据的深度学习回归，本文就直接用了当时做深度学习时处理好的输入数据，因此以下两个数据处理的基本过程就没有再涉及啦），大家直接查看下方所列出的其它几篇博客即可.

初始数据划分训练集与测试集。
类别变量的独热编码（One-hot Encoding）。

针对上述两个数据处理过程，首先，数据训练集与测试集的划分在机器学习、深度学习中是不可或缺的作用，这一部分大家可以查看 Python TensorFlow深度学习回归代码：DNNRegressor 的 2.4 部分，或 Python TensorFlow深度神经网络回归：keras.Sequential 的 2.3 部分；其次，关于类别变量的独热编码，对于随机森林等传统机器学习方法而言可以说同样是非常重要的，这一部分大家可以查看 Python实现类别变量的独热编码（One-hot Encoding） .

在本文中，如前所述，我们直接将已经存在 .csv 中，已经划分好训练集与测试集且已经对类别变量做好了独热编码之后的数据加以导入。在这里，我所导入的数据第一行是表头，即每一列的名称。关于 .csv 数据导入的代码详解，大家可以查看多变量两两相互关系联合分布图的Python绘制的数据导入部分.

                        
                          # Data import

'''
column_name=['EVI0610','EVI0626','EVI0712','EVI0728','EVI0813','EVI0829','EVI0914','EVI0930','EVI1016',
             'Lrad06','Lrad07','Lrad08','Lrad09','Lrad10',
             'Prec06','Prec07','Prec08','Prec09','Prec10',
             'Pres06','Pres07','Pres08','Pres09','Pres10',
             'SIF161','SIF177','SIF193','SIF209','SIF225','SIF241','SIF257','SIF273','SIF289',
             'Shum06','Shum07','Shum08','Shum09','Shum10',
             'Srad06','Srad07','Srad08','Srad09','Srad10',
             'Temp06','Temp07','Temp08','Temp09','Temp10',
             'Wind06','Wind07','Wind08','Wind09','Wind10',
             'Yield']
'''
train_data=pd.read_csv(train_data_path,header=0)
test_data=pd.read_csv(test_data_path,header=0)

1.2 特征与标签分离

特征与标签，换句话说其实就是自变量与因变量。我们要将训练集与测试集中对应的特征与标签分别分离开来.

                        
                          # Separate independent and dependent variables

train_Y=np.array(train_data['Yield'])
train_X=train_data.drop(['ID','Yield'],axis=1)
train_X_column_name=list(train_X.columns)
train_X=np.array(train_X)

test_Y=np.array(test_data['Yield'])
test_X=test_data.drop(['ID','Yield'],axis=1)
test_X=np.array(test_X)

可以看到，直接借助 drop 就可以将标签 'Yield' 从原始的数据中剔除（同时还剔除了一个 'ID' ，这个是初始数据的样本编号，后面就没什么用了，因此随着标签一起剔除）。同时在这里，还借助了 train_X_column_name 这一变量，将每一个特征值列所对应的标题（也就是特征的名称）加以保存，供后续使用.

1.3 RF模型构建、训练与预测

接下来，我们就需要对随机森林模型加以建立，并训练模型，最后再利用测试集加以预测。在这里需要注意，关于随机森林的几个重要超参数（例如下方的 n_estimators ）都是需要不断尝试找到最优的。关于这些超参数的寻优，在 MATLAB 中的实现方法大家可以查看 MATLAB实现随机森林（RF）回归与自变量影响程度分析的 1.1 部分；而在 Python 中的实现方法，我们将在下一篇博客中介绍.

                        
                          # Build RF regression model

random_forest_model=RandomForestRegressor(n_estimators=200,random_state=random_forest_seed)
random_forest_model.fit(train_X,train_Y)

# Predict test set data

random_forest_predict=random_forest_model.predict(test_X)
random_forest_error=random_forest_predict-test_Y

其中，利用 RandomForestRegressor 进行模型的构建， n_estimators 就是树的个数， random_state 是每一个树利用 Bagging 策略中的 Bootstrap 进行抽样（即有放回的袋外随机抽样）时，随机选取样本的随机数种子； fit 进行模型的训练， predict 进行模型的预测，最后一句就是计算预测的误差.

1.4 预测图像绘制、精度衡量指标计算与保存

首先，进行预测图像绘制，其中包括预测结果的拟合图与误差分布直方图。关于这一部分代码的解释，大家可以查看 Python TensorFlow深度学习回归代码：DNNRegressor 的 2.9 部分.

                        
                          # Draw test plot

plt.figure(1)
plt.clf()
ax=plt.axes(aspect='equal')
plt.scatter(test_Y,random_forest_predict)
plt.xlabel('True Values')
plt.ylabel('Predictions')
Lims=[0,10000]
plt.xlim(Lims)
plt.ylim(Lims)
plt.plot(Lims,Lims)
plt.grid(False)
    
plt.figure(2)
plt.clf()
plt.hist(random_forest_error,bins=30)
plt.xlabel('Prediction Error')
plt.ylabel('Count')
plt.grid(False)

以上两幅图的绘图结果如下所示.

接下来，进行精度衡量指标的计算与保存。在这里，我们用皮尔逊相关系数、决定系数与 RMSE 作为精度的衡量指标，并将每一次模型运行的精度衡量指标结果保存在一个 Excel 文件中。这一部分大家同样查看 Python TensorFlow深度学习回归代码：DNNRegressor 的 2.9 部分即可.

                        
                          # Verify the accuracy

random_forest_pearson_r=stats.pearsonr(test_Y,random_forest_predict)
random_forest_R2=metrics.r2_score(test_Y,random_forest_predict)
random_forest_RMSE=metrics.mean_squared_error(test_Y,random_forest_predict)**0.5
print('Pearson correlation coefficient is {0}, and RMSE is {1}.'.format(random_forest_pearson_r[0],
                                                                        random_forest_RMSE))

# Save key parameters

excel_file=load_workbook(write_excel_path)
excel_all_sheet=excel_file.sheetnames
excel_write_sheet=excel_file[excel_all_sheet[0]]
excel_write_sheet=excel_file.active
max_row=excel_write_sheet.max_row
excel_write_content=[random_forest_pearson_r[0],random_forest_R2,random_forest_RMSE,random_seed,random_forest_seed]
for i in range(len(excel_write_content)):
        exec("excel_write_sheet.cell(max_row+1,i+1).value=excel_write_content[i]")
excel_file.save(write_excel_path)

1.5 决策树可视化

这一部分我们借助 DOT 这一图像描述语言，进行随机森林算法中决策树的绘制.

                        
                          # Draw decision tree visualizing plot

random_forest_tree=random_forest_model.estimators_[5]
export_graphviz(random_forest_tree,out_file=tree_graph_dot_path,
                feature_names=train_X_column_name,rounded=True,precision=1)
(random_forest_graph,)=pydot.graph_from_dot_file(tree_graph_dot_path)
random_forest_graph.write_png(tree_graph_png_path)

其中， estimators_[5] 是指整个随机森林算法中的第 6 棵树（下标是从 0 开始的），换句话说我们就是从很多的树（具体树的个数就是前面提到的超参数 n_estimators ）中抽取了找一个来画图，做一个示范。如下图所示.

可以看到，单单是这一棵树就已经非常非常庞大了。我们将上图其中最顶端（也就是最上方的节点—— 根节点）部分放大，就可以看见每一个节点对应的信息。如下图。

在这里提一句，上图根节点中有一个 samples=151 ，但是我的样本总数是 315 个，为什么这棵树的样本个数不是全部的样本个数呢?

其实这就是随机森林的内涵所在：随机森林的每一棵树的输入数据（也就是该棵树的根节点中的数据），都是随机选取的（也就是上面我们说的利用 Bagging 策略中的 Bootstrap 进行随机抽样），最后再将每一棵树的结果聚合起来（聚合这个过程就是 Aggregation ，我们常说的 Bagging 其实就是 Bootstrap 与 Aggregation 的合称），形成随机森林算法最终的结果.

1.6 变量重要性分析

在这里，我们进行变量重要性的分析，并以图的形式进行可视化.

                        
                          # Calculate the importance of variables

random_forest_importance=list(random_forest_model.feature_importances_)
random_forest_feature_importance=[(feature,round(importance,8)) 
                                  for feature, importance in zip(train_X_column_name,random_forest_importance)]
random_forest_feature_importance=sorted(random_forest_feature_importance,key=lambda x:x[1],reverse=True)
plt.figure(3)
plt.clf()
importance_plot_x_values=list(range(len(random_forest_importance)))
plt.bar(importance_plot_x_values,random_forest_importance,orientation='vertical')
plt.xticks(importance_plot_x_values,train_X_column_name,rotation='vertical')
plt.xlabel('Variable')
plt.ylabel('Importance')
plt.title('Variable Importances')

得到图像如下所示。这里是由于我的特征数量（自变量数量）过多，大概有 150 多个，导致横坐标的标签（也就是自变量的名称）都重叠了；大家一般的自变量个数都不会太多，就不会有问题~ 。

以上就是全部的代码分段介绍~ 。

2 完整代码

                        
                          # -*- coding: utf-8 -*-
"""
Created on Sun Mar 21 22:05:37 2021

@author: fkxxgis
"""

import pydot
import numpy as np
import pandas as pd
import scipy.stats as stats
import matplotlib.pyplot as plt
from sklearn import metrics
from openpyxl import load_workbook
from sklearn.tree import export_graphviz
from sklearn.ensemble import RandomForestRegressor


# Attention! Data Partition
# Attention! One-Hot Encoding

train_data_path='G:/CropYield/03_DL/00_Data/AllDataAll_Train.csv'
test_data_path='G:/CropYield/03_DL/00_Data/AllDataAll_Test.csv'
write_excel_path='G:/CropYield/03_DL/05_NewML/ParameterResult_ML.xlsx'
tree_graph_dot_path='G:/CropYield/03_DL/05_NewML/tree.dot'
tree_graph_png_path='G:/CropYield/03_DL/05_NewML/tree.png'

random_seed=44
random_forest_seed=np.random.randint(low=1,high=230)

# Data import

'''
column_name=['EVI0610','EVI0626','EVI0712','EVI0728','EVI0813','EVI0829','EVI0914','EVI0930','EVI1016',
             'Lrad06','Lrad07','Lrad08','Lrad09','Lrad10',
             'Prec06','Prec07','Prec08','Prec09','Prec10',
             'Pres06','Pres07','Pres08','Pres09','Pres10',
             'SIF161','SIF177','SIF193','SIF209','SIF225','SIF241','SIF257','SIF273','SIF289',
             'Shum06','Shum07','Shum08','Shum09','Shum10',
             'Srad06','Srad07','Srad08','Srad09','Srad10',
             'Temp06','Temp07','Temp08','Temp09','Temp10',
             'Wind06','Wind07','Wind08','Wind09','Wind10',
             'Yield']
'''
train_data=pd.read_csv(train_data_path,header=0)
test_data=pd.read_csv(test_data_path,header=0)

# Separate independent and dependent variables

train_Y=np.array(train_data['Yield'])
train_X=train_data.drop(['ID','Yield'],axis=1)
train_X_column_name=list(train_X.columns)
train_X=np.array(train_X)

test_Y=np.array(test_data['Yield'])
test_X=test_data.drop(['ID','Yield'],axis=1)
test_X=np.array(test_X)

# Build RF regression model

random_forest_model=RandomForestRegressor(n_estimators=200,random_state=random_forest_seed)
random_forest_model.fit(train_X,train_Y)

# Predict test set data

random_forest_predict=random_forest_model.predict(test_X)
random_forest_error=random_forest_predict-test_Y

# Draw test plot

plt.figure(1)
plt.clf()
ax=plt.axes(aspect='equal')
plt.scatter(test_Y,random_forest_predict)
plt.xlabel('True Values')
plt.ylabel('Predictions')
Lims=[0,10000]
plt.xlim(Lims)
plt.ylim(Lims)
plt.plot(Lims,Lims)
plt.grid(False)
    
plt.figure(2)
plt.clf()
plt.hist(random_forest_error,bins=30)
plt.xlabel('Prediction Error')
plt.ylabel('Count')
plt.grid(False)

# Verify the accuracy

random_forest_pearson_r=stats.pearsonr(test_Y,random_forest_predict)
random_forest_R2=metrics.r2_score(test_Y,random_forest_predict)
random_forest_RMSE=metrics.mean_squared_error(test_Y,random_forest_predict)**0.5
print('Pearson correlation coefficient is {0}, and RMSE is {1}.'.format(random_forest_pearson_r[0],
                                                                        random_forest_RMSE))

# Save key parameters

excel_file=load_workbook(write_excel_path)
excel_all_sheet=excel_file.sheetnames
excel_write_sheet=excel_file[excel_all_sheet[0]]
excel_write_sheet=excel_file.active
max_row=excel_write_sheet.max_row
excel_write_content=[random_forest_pearson_r[0],random_forest_R2,random_forest_RMSE,random_seed,random_forest_seed]
for i in range(len(excel_write_content)):
        exec("excel_write_sheet.cell(max_row+1,i+1).value=excel_write_content[i]")
excel_file.save(write_excel_path)

# Draw decision tree visualizing plot

random_forest_tree=random_forest_model.estimators_[5]
export_graphviz(random_forest_tree,out_file=tree_graph_dot_path,
                feature_names=train_X_column_name,rounded=True,precision=1)
(random_forest_graph,)=pydot.graph_from_dot_file(tree_graph_dot_path)
random_forest_graph.write_png(tree_graph_png_path)

# Calculate the importance of variables

random_forest_importance=list(random_forest_model.feature_importances_)
random_forest_feature_importance=[(feature,round(importance,8)) 
                                  for feature, importance in zip(train_X_column_name,random_forest_importance)]
random_forest_feature_importance=sorted(random_forest_feature_importance,key=lambda x:x[1],reverse=True)
plt.figure(3)
plt.clf()
importance_plot_x_values=list(range(len(random_forest_importance)))
plt.bar(importance_plot_x_values,random_forest_importance,orientation='vertical')
plt.xticks(importance_plot_x_values,train_X_column_name,rotation='vertical')
plt.xlabel('Variable')
plt.ylabel('Importance')
plt.title('Variable Importances')

至此，大功告成.

最后此篇关于Python实现随机森林RF并对比自变量的重要性的文章就讲到这里了,如果你想了解更多关于Python实现随机森林RF并对比自变量的重要性的内容请搜索CFSDN的文章或继续浏览相关文章，希望大家以后支持我的博客！。

27

4

0

文章推荐：服务端技术方案模板参考

文章推荐： Python装饰器实例讲解(三)

文章推荐：从实现到原理，聊聊Java中的SPI动态扩展

文章推荐： OpenMPSectionsConstruct实现原理以及源码分析

unix - rm -rf 中的 rf 代表什么？
Unix 中 rm -rf 中的 rf 代表什么？更一般地说，我很难记住 Unix 命令和选项，因为我不明白它们代表什么。是否有资源可以解释这些简写的含义？最佳答案在rm中， -r 代表递归 -
makefile - rm -rf 与 -rm -rf
在 Makefile 中，我读到: -rm -rf(而不是 rm -rf)。 Makefile 中行开头的第一个“-”是什么意思？最佳答案这意味着make本身将忽略来自rm的任何错误代码。在 m
Powershell命令: rm -rf
rm是删除item，但是参数-rf的作用是什么？每当我输入 help -rf 时，它都会打印 powershell 中可用命令的完整列表。如果您在 powershell 中输入 rm -rf 会发生
linux - 为什么使用-rf？
已关闭。此问题不符合Stack Overflow guidelines 。目前不接受答案。这个问题似乎不是关于 a specific programming problem, a software
bash - 对目录使用 rm -rf
我尝试搜索 SO，但无法找到以下命令之间的区别。如果我有一个名为 dir 的目录，下面的命令有何不同？ rm -rf dir/* rm -rf dir/ rm -rf dir 另外，如果运行命令的 i
python - 为什么交叉验证 RF 分类的性能比没有交叉验证的差？
我很困惑，为什么没有交叉验证的随机森林分类模型的平均准确度得分为 0.996，而具有 5 折交叉验证的模型的平均准确度得分为 0.687。有 275,956 个样本。 0级=217891，1级=60
python - Django RF 可写嵌套序列化器
我正在尝试在 DRF3 中制作一个可写的嵌套序列化器。我有一个模型音乐会，其中有一个 m2m 领域“技术人员”到我的用户模型。我已成功在其 View 中添加连接到 Concert 实例的用户列表。现在
php - RM -rf 仅文件
rm -rf 只删除文件不删除文件夹我有一个数据库，我循环遍历 file_path 并检查 file_exist 如果文件不存在。我执行 rm -rf $db_value 但有时 db_value
windows - rm -rf 没有完全清理目录
首先，这是windows系统，不是linux。在我的 makefile 中，我正在做 make clean rm -rf output 但是，第一次运行时，会报错 rm: cannot lstat `
linux - 防止 "rm -rf"
关闭。这个问题不符合Stack Overflow guidelines .它目前不接受答案。这个问题似乎不是关于 a specific programming problem, a softwar
linux - 如何撤消 rm -rf？
关闭。这个问题不符合Stack Overflow guidelines .它目前不接受答案。这个问题似乎不是关于 a specific programming problem, a softwar
linux rm -rf * 删除命令？
关闭。这个问题不符合Stack Overflow guidelines .它目前不接受答案。这个问题似乎不是关于 a specific programming problem, a softwar
Dockerfile "rm -Rf"失败
我有一个非常简单的 dockerfile，用“rm -Rf”在安装后删除安装文件，但是我收到了一些错误，例如: Step 4/4 : RUN rm -Rf /INSTALLATION ---> Run
tcl - TCL中 `rm -rf *`如何
我想使用 TCL 删除目录中的所有文件。 (我在 Win 10 下使用 Xilinx Vivado 的 TCL 控制台。)我在 TCL documentation 中发现了这一点。那个 file de
git - "git rm"- "-rf"是什么意思？
git中的“-rf”是什么意思？我用了 git rm -rf directories 但我不知道 -rf 的实际含义最佳答案 -r - 当给出前导目录名称时允许递归删除。 -f - 覆盖最新检查。
bash - bash rm -rf!上的语法错误意外的标记
当您将其键入控制台并运行时，此命令将完美运行。 rm -rf !(folder1|file_name|log.txt|*.sh|*.conf) 但是，当我从bash脚本运行它时，出现以下错误； ./t
rpm - .rpm 文件名中的 .rf 部分是什么意思？
已关闭。这个问题是 off-topic 。目前不接受答案。想要改进这个问题吗？ Update the question所以它是on-topic用于堆栈溢出。已关闭10 年前。 Improve th
python - 使用 pysandbox 限制功能 (RF)
我的问题与 here 完全相同和 here . 我还使用 simple2.py 对该程序生成的可执行文件进行沙箱处理 test1.c: #include int main(){ puts
machine-learning - 预测 RF 后的数据归一化
使用随机森林算法进行回归，我在 iternet 中发现，在预测之后，它们对预测结果进行归一化，这意味着我们假设结果是 pred pred = pred = pred*(np.exp(-pred/100
bash - 不能 rm -rf 使用变量
我正在尝试编写一个 BASH 脚本，它将递归删除目录中的所有文件。当我在控制台 rm -rf/home/dir/dir/* 中写入时，效果很好，但我不能使用变量来完成。 VAR="/home/dir

首页

博学

6Ren·AI

商城