python - 受过训练的 "Decision Tree"VS “Decision Path”-6ren

python - 受过训练的 "Decision Tree"VS “Decision Path”

转载作者：太空宇宙更新时间：2023-11-04 00:06:01

25

4

我正在使用 scikit“决策树”分类器 来预测迁移项目的“工作量”。我的另一部分要求是找到影响预测的特征。

我训练了模型，并得到了一个层次结构树，其中所有特征都位于不同的节点。

我以为在我提供测试记录时将使用同一棵树来预测大小。但令我惊讶的是，事实并非如此!!

预测后，我打印了 decision_path 以查看“该预测中考虑的特征”。

这个决策路径与模型构建的树完全不同。

如果树不是用来做预测的，那树有什么用。

我如何使用决策路径来获得该预测中的重要特征？

如果我导出这些规则集并用于查找决策路径，那将给我错误的特征或与决策路径的输出不匹配。

编辑 1

添加了通用代码。它给出了类似的输出。

from __future__ import print_function
import pandas as pd
import numpy as np

from sklearn import preprocessing
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn import tree
# Create tree object 
import graphviz
import pydotplus
import collections

file_path = "sample_data_generic.csv"
data = pd.read_csv( file_path )
data.head()

df = data.copy()
cols = df.columns
col_len = len(cols)
features_category = []

for col_index in range( col_len ):
    if df[ cols[col_index] ].dtype == 'object' or df[ cols[col_index] ].dtype == 'float64':
        df[ cols[col_index] ] = df[ cols[col_index] ].astype('category')
        features_category.append( cols[col_index] )

#redefining the variable value as it is throwing some error in the below lines due to the presence of next line char?!
features_category = ['Cloud Provider', 'OS Upgrade Path', 'Target_OS_NAME', 'Target_OS_VERSION', 'os_version']

# create dataframe for target variable
df_target = df['Size']
df.drop('Size', axis=1, inplace=True)

df = pd.get_dummies(df, columns=features_category, dtype='int')

df.head()

df_x_data = df.copy()
df_x_data.head()
y_data = df_target
target_classes = y_data.unique()
target_classes = target_classes.astype('category')
test_size_val = 0.3

x_train, x_test, y_train, y_test = train_test_split(df_x_data, y_data, test_size=test_size_val, random_state=1)


print("number of test samples :", x_test.shape[0])
print("number of training samples:",x_train.shape[0])

x_train.sort_values(['Comps'], ascending=[True]) #, 'Estimation'
model = tree.DecisionTreeClassifier()
model = model.fit(x_train, y_train)
model.score(x_test, y_test)
dot_data = tree.export_graphviz(model, out_file=None, 
                     feature_names=x_train.columns,  
                     class_names=target_classes,  
                     filled=True, rounded=True,  
                     special_characters=True)  
graph = pydotplus.graph_from_dot_data(dot_data)
print('graph: ', graph)
colors = ('white','red', 'green')

edges = collections.defaultdict(list)

for edge in graph.get_edge_list():
    edges[edge.get_source()].append(int(edge.get_destination()))
print( edges )
for edge in edges:
    edges[edge].sort()
    for i in range(2):
        dest = graph.get_node(str(edges[edge][i]))[0]
        dest.set_fillcolor(colors[i])
        
graph.write_png('decision_tree_2019_generic.png')

from IPython.display import Image
Image(filename = 'decision_tree_2019_generic.png')

to_predict = x_test[3:4]
model.predict( to_predict )

to_predict.values

applied = model.apply( to_predict )
applied

to_predict

decision_path = model.decision_path( to_predict )
print( decision_path.indices, '\n' )
print( decision_path[:1][:1])

predict_cols = decision_path.indices

predicted_row = to_predict
cols = predicted_row.columns
#print("len of cols: ", len(cols) )
for col in predict_cols:
    print( cols[col], predicted_row[ cols[col] ].values )

示例数据:是目前生成的数据。

Cloud Provider,Comps,env,hosts,OS Upgrade Path,Target_OS_NAME,Target_OS_VERSION,Size,os_versionAWS,11,2,3833,不直接,Linux,4,M,2谷歌云,16,6,4779,Direct,Mac,3,S,1AWS,18,6,6677,不直接,Linux,7,S,8谷歌云,34,2,1650,直接,Windows,5,B,1AWS,35,6,9569,Direct,Windows,6,M,3AWS,36,6,7421,不直接,Windows,3,B,5谷歌云,49,4,3469,Direct,Mac,6,B,1AWS,54,5,5677,Direct,Mac,4,M,8

但是预测的测试数据的决策路径是:Comps [206] --> env [3] --> hosts [637]

提前致谢

最佳答案

我认为您误解了 decision_path 的返回值:它使用树的内部表示中的节点索引返回一个稀疏矩阵，指示预测经过树的哪些节点。这些并不意味着(实际上也不是)与数据集的列对齐。相反，如果您想访问哪些功能与预测所经过的节点相关，请尝试:

predict_nodes = decision_path.indices
predicted_row = to_predict
cols = predicted_row.columns
for node in predict_nodes:
    col = model.tree_.feature[node]
    print( cols[col], predicted_row[ cols[col] ].values )

请注意，叶节点显然没有测试特征，并且(根据我的经验)返回特征索引的负值，所以也要注意这一点。

要了解有关树的内部结构的更多信息，请参阅 this示例，并且(按照文档的建议)使用 help(sklearn.tree._tree.Tree)

关于python - 受过训练的 "Decision Tree"VS “Decision Path”，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/54058350/

25

4

0

文章推荐： python - 如何像在 3dsMax 中一样实现对鼠标的缩放？

文章推荐： node.js - 下载 GeckoDriver 时 webdriver-manager 更新失败

文章推荐： css - 如何使用 CSS 在表格行旁边显示一个颜色框？

mosaic-decisions - 为 Mosaic Decisions 中的列设置默认值
我在 Mosaic Decisions 中使用数据流，并且使用 MySQL 编写器节点。我要写的结果集有一个字段 inserted-time .但我想跳过此列中的值，并希望使用为 DB 表中该列设置的
mosaic-decisions - Mosaic Decisions Azure BLOB 编写器节点创建多个文件
我正在使用镶嵌决策数据流功能从 Azure blob 读取文件，进行一些转换并将该数据写回 Azure。它工作正常，除了在我给出的输出文件路径中，它创建了一个文件夹，我可以看到许多文件的名称中带有一些
python - 受过训练的 "Decision Tree"VS “Decision Path”
我正在使用 scikit“决策树”分类器来预测迁移项目的“工作量”。我的另一部分要求是找到影响预测的特征。我训练了模型，并得到了一个层次结构树，其中所有特征都位于不同的节点。我以为在我提供测试记
Mosaic-Decisions:不同类型的参数
mosaic decision提供了哪些不同类型的参数？input、calculated、sql和global variables有什么区别？最佳答案 Mosaic有两类参数: 1。系统参数 - 这
decision-tree - 如何计算决策树的准确性？
嗨，我正在参加 Coursera 的类(class)并遇到了这个问题。我的答案是 1-(4048+3456)/8124=0.076。然而，答案是 0.067。有人可以帮我解决这个问题吗？谢谢!! 最佳
decision-tree - 分类决策树中的学习曲线是什么意思？
我在分析中使用了分类决策树。首先，我将整个数据分为训练和测试——60%:40%。然后我在我的训练集上使用 GridSearch 来获得得分最高的模型 (max_depth=7)。然后我绘制了交叉验证集
decision-tree - 分类决策树中的学习曲线是什么意思？
我在分析中使用了分类决策树。首先，我将整个数据分为训练和测试——60%:40%。然后我在我的训练集上使用 GridSearch 来获得得分最高的模型 (max_depth=7)。然后我绘制了交叉验证集
decision-tree - XGBoost 修剪步骤在做什么？
当我使用 XGBoost 拟合模型时，它通常会显示一系列消息，例如“updater_prune.cc:74: tree pruning end, 1 个 root, 6 extra nodes, 0
boost - 基于分布的弱学习器 : Decision stump
我需要 boost 决策树桩弱分类器。因此，对于每次迭代，我都必须根据某些权重来训练弱分类器。然后我将在每次迭代后更新权重。到目前为止我已经明白了。但对我来说不清楚的部分是“基于权重训练决策树桩弱分类
python - "decision matrix"的干净实现
我正试图找到一个干净的解决方案来在 python 中实现一个基本的决策矩阵。我有 8 个传感器监测一个装置，根据这 8 个传感器的状态，我需要激活一些继电器。我的决策矩阵看起来像(S 是传感器，R
java - ANTLR Decision 可以使用多种选择来匹配输入
我有这个简单的语法: expr: factor; factor: atom (('*' ^ | '/'^) atom)*; atom: INT | ':' expr; INT: ('0'..'
language-agnostic - 代码高尔夫 : Decision Tree
锁定。这个问题及其答案是locked因为这个问题是题外话，但具有历史意义。它目前不接受新的答案或交互。在 Google Code Jam 2009 中，Round 1B ，有一个叫做决策树的问题，它
javascript - AngularJS : Decision tree implementation
我正在尝试编写一个测验应用程序。下面是源代码。目前它的工作原理如下: 点击开始测验后，第一个问题就会出现。用户选择了正确的选项，然后说出问题 3。如果用户选择了错误的选项，则会转到另一个问题，例如
coq - 战术自动化 : simple decision procedure
我正在尝试自动确定 ASCII 字符是否为空格的决策过程。这是我目前拥有的。 Require Import Ascii String. Scheme Equality for ascii. Defin
algorithm - 多项式时间 : Accepting and Decision Algorithms
关闭。这个问题不符合Stack Overflow guidelines .它目前不接受答案。这个问题似乎与 help center 中定义的范围内的编程无关。 . 关闭 8 年前。 Improve
decision-tree - R错误:fit is not a tree,中的决策树只是一个根
下午好! 我有决策树的问题。 f11<-as.factor(Z24train$f1) fit_f1 <- rpart(f11~TSU+TSL+TW+TP,data = Z24train,method=
java - SAP Crystal Decisions 集成不适用于 Linux
我们正在构建一个调用 SAP Crystal Decisions 库的 Web API，以生成 PDF 报告。它在我们的 Windows 10 开发人员 PC 上运行时可以正常工作，但是当我们将其部署
python - Pandas : Making Decision on groupby size()
我正在尝试使用两个电子表格进行“更改数据捕获”。我对生成的数据框进行了分组，但遇到了一个奇怪的问题。要求: 案例 1)一个组的大小 == 2，做某些任务情况 2)一个组的大小 == 1 ，做某些任务
python - 逻辑回归 : plotting decision boundary from theta
我有以下代码: x1 = np.random.randn(100) y1 = np.random.randn(100) + 3 x2 = np.random.randn(100) + 3 y2 = n
html - 水平线 : Good or Bad design decision
关闭。这个问题是opinion-based .它目前不接受答案。想要改进这个问题？更新问题，以便 editing this post 可以用事实和引用来回答它. 关闭 3 年前。 Improve

首页

博学

6Ren·AI

商城

python - 受过训练的 "Decision Tree"VS “Decision Path”