machine-learning - ML DecisionTreeClassifier

machine-learning - ML DecisionTreeClassifier - 连续特征

转载作者：行者123 更新时间：2023-11-30 09:19:27

如何告诉 ml.DecisionTreeClassifier 对连续特征而不是分类特征进行评分，而无需使用 Bucketizer 或 QuantileDiscretizer 方法？

下面是我将连续特征传递到 ML 中的 DecisionTreeClassifier 的代码，如果没有对特征进行 Binning (Buckizer)，则大多数评分集将被忽略而不是评分(spark 2.1 不支持 keep)。

from pyspark.mllib.linalg import Vectors
from pyspark.ml import Pipeline
from pyspark.sql import Row, SparkSession, SQLContext
from pyspark.sql.types import StringType, DoubleType 
from pyspark.ml.feature import StringIndexer, VectorAssembler, OneHotEncoder
from pyspark.ml.classification import DecisionTreeClassifier
from pyspark import SparkConf, SparkContext
from pyspark.sql.functions import udf

# Load the training set that is in parquet format into a data frame
train_df = sqlContext.read.parquet("/data/training_set")

# convert data types to double
train_df.withColumn("income", train_df["income"].cast(DoubleType())
train_df.withColumn("age", train_df["age"].cast(DoubleType())

# StringIndexer - Target
# First we will StringIndexer to get numeric categorical features
indexer1 = StringIndexer(inputCol="target", outputCol="target_numeric", handleInvalid="skip")

############
# StringIndexer/OneHotEncoder - age_in_two_year_increments_2nd_individual
# First we will StringIndexer to get numeric categorical features
indexer2 = StringIndexer(inputCol="income", outputCol="income_numeric", handleInvalid='skip')

# Next we change the categorical feature into binarizing via OneHotEncoder
encoder2 = OneHotEncoder(inputCol="income_numeric", outputCol="income_vector")
############

############
# StringIndexer/OneHotEncoder - age_in_two_year_increments_2nd_individual
# First we will StringIndexer to get numeric categorical features
indexer3 = StringIndexer(inputCol="age", outputCol="age_numeric", handleInvalid='skip')

# Next we change the categorical feature into binarizing via OneHotEncoder
encoder3 = OneHotEncoder(inputCol="age_numeric", outputCol="age_vector")
############

# Create distinct StringIndexer transformers with the outputCol
# parameter set to be the name of the input column appended 
indexedcols = [
"income_vector",
"age_vector"
]

# FEATURES need to be in a Vector which is why this is converted using a VectorAssembler
# The VectorAssember is going to take as input our index columns and our output will be the features.
# Create a VectorAssembler transformer to combine all of the indexed
# categorical features into a vector. Provide the "indexedcols" list
# created above as the inputCols parameter, and name the outputCol "features".
va = VectorAssembler(inputCols = indexedcols, outputCol = 'features')

# Create a DecisionTreeClassifier, setting the label column to your
# indexed label column ("label_ix") and the features column to the
# newly created column from the VectorAssembler above ("features").
# Store the new StringIndexer transformers, the VectorAssembler,
# as well as the DecisionTreeClassifier in a list called "steps"
clf = DecisionTreeClassifier(labelCol="target_numeric", impurity="gini",  maxBins=32, maxMemoryInMB=1024)

#  Create steps for transform for the ml pipeline
steps = [indexer1, 
        indexer2, encoder2, 
        indexer3, encoder3,
        va, clf]

# Create a ML pipeline named "pl" using the steps list to set the stages parameter
pl = Pipeline(stages=steps)

# Run the fit method of the pipeline on the DataFrame
# model in a new variable called "plmodel"
plmodel = pl.fit(train_df)

######################################################################################
# Scoring Set
######################################################################################

# Now get the data you want to run the model against 
scoring_df = sqlContext.read.parquet("/data/scoring_set")

# convert data types to double
scoring_df.withColumn("income", scoring_df["income"].cast(DoubleType())
scoring_df.withColumn("age", scoring_df["age"].cast(DoubleType())

# Run the transform method of the pipeline model created above
# on the "test_df" DataFrame to create a new DataFrame called "predictions"
#
# skip past any labels that are not in the training set.  If you don't skip then errors will be produced 
#saying unseen label:40 which means the scoring set has a new element that did not exist in the training set for the feature.
predictions = plmodel.transform(scoring_DF)

vector_udf1 = udf(lambda vector: float(vector[1]))
vector_udf0 = udf(lambda vector: float(vector[0]))

# Save dataframe to hdfs
outputDF = predictions.select("age", \
"income", \
"prediction", \
vector_udf1("probability").alias("probability0")), \
vector_udf1("probability").alias("probability1")).write.format("parquet").mode("overwrite").save("/data/algo_scored")

最佳答案

对于连续特征，无需使用 Bucketizer 或 QuantileDiscretizer。对于分类特征，您可以使用 StringIndexer 和 OneHotEncoder 并将其包含在管道中，但对于连续特征，您只需要使用 VectorAssembler 指定特征，DecisionTreeClassifier 将自动对特征进行分类。

所以代码看起来像:

from pyspark.mllib.linalg import Vectors
from pyspark.ml import Pipeline
from pyspark.sql import Row, SparkSession, SQLContext
from pyspark.sql.types import StringType, DoubleType 
from pyspark.ml.feature import StringIndexer, VectorAssembler, OneHotEncoder
from pyspark.ml.classification import DecisionTreeClassifier
from pyspark import SparkConf, SparkContext
from pyspark.sql.functions import udf

# Load the training set that is in parquet format into a data frame
train_df = sqlContext.read.parquet("/data/training_set")

# convert data types to double
train_df.withColumn("income", train_df["income"].cast(DoubleType())
train_df.withColumn("age", train_df["age"].cast(DoubleType())

# StringIndexer - Target
# First we will StringIndexer to get numeric categorical features
indexer1 = StringIndexer(inputCol="target", outputCol="target_numeric", handleInvalid="skip")

# Create distinct StringIndexer transformers with the outputCol
# parameter set to be the name of the input column appended 
indexedcols = [
"income",
"age"
]

# FEATURES need to be in a Vector which is why this is converted using a VectorAssembler
# The VectorAssember is going to take as input our index columns and our output will be the features.
# Create a VectorAssembler transformer to combine all of the indexed
# categorical features into a vector. Provide the "indexedcols" list
# created above as the inputCols parameter, and name the outputCol "features".
va = VectorAssembler(inputCols = indexedcols, outputCol = 'features')

# Create a DecisionTreeClassifier, setting the label column to your
# indexed label column ("label_ix") and the features column to the
# newly created column from the VectorAssembler above ("features").
# Store the new StringIndexer transformers, the VectorAssembler,
# as well as the DecisionTreeClassifier in a list called "steps"
clf = DecisionTreeClassifier(labelCol="target_numeric", impurity="gini",  maxBins=32, maxMemoryInMB=1024)

#  Create steps for transform for the ml pipeline
steps = [indexer1, 
        va, clf]

# Create a ML pipeline named "pl" using the steps list to set the stages parameter
pl = Pipeline(stages=steps)

# Run the fit method of the pipeline on the DataFrame
# model in a new variable called "plmodel"
plmodel = pl.fit(train_df)

######################################################################################
# Scoring Set
######################################################################################

# Now get the data you want to run the model against 
scoring_df = sqlContext.read.parquet("/data/scoring_set")

# convert data types to double
scoring_df.withColumn("income", scoring_df["income"].cast(DoubleType())
scoring_df.withColumn("age", scoring_df["age"].cast(DoubleType())

# Run the transform method of the pipeline model created above
# on the "test_df" DataFrame to create a new DataFrame called "predictions"
#
# skip past any labels that are not in the training set.  If you don't skip then errors will be produced 
#saying unseen label:40 which means the scoring set has a new element that did not exist in the training set for the feature.
predictions = plmodel.transform(scoring_DF)

vector_udf1 = udf(lambda vector: float(vector[1]))
vector_udf0 = udf(lambda vector: float(vector[0]))

# Save dataframe to hdfs
outputDF = predictions.select("age", \
"income", \
"prediction", \
vector_udf1("probability").alias("probability0")), \
vector_udf1("probability").alias("probability1")).write.format("parquet").mode("overwrite").save("/data/algo_scored")

关于machine-learning - ML DecisionTreeClassifier - 连续特征，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/45379559/

文章推荐： python - PyTorch 模型未进行训练

文章推荐： java - 将字符串转换为日期 WebService 时出错

文章推荐： c# - 为 SVM 准备加速度计数据

python - DecisionTreeClassifier - 树的手动修剪
我正在制作一个交互式建模工具。这个想法是用决策树生成变量。然而，这个变量需要具有经济意义(我希望能够删除理论上没有意义的分割)。因此，我用plotly绘制了一棵树，以便能够监听用户点击的位置。我在下面
python - 多个级别的 DecisionTreeClassifier
我正在尝试对具有多个级别的对象进行分类。我解释它的最好方法是用一个例子: 我可以做到这一点: from sklearn import tree features = ['Hip Hop','Bosto
python - DecisionTreeClassifier 的精确召回曲线下的面积是一个正方形
我正在使用 scikit-learn 中的 DecisionTreeClassifier 对一些数据进行分类。我还使用其他算法，并使用精确召回指标下的面积来比较它们。问题是 DecisionTreeC
python - DecisionTreeClassifier 中两片叶子之间的距离
有没有办法计算 decision tree 中两片叶子之间的距离？ . 距离是指从一片叶子到另一片叶子的节点数。例如，在此示例图中: distance(leaf1, leaf2) == 1 dist
scipy - sklearn DecisionTreeClassifier 更深度更准确？
我有两个学习过的sklearn.tree.tree.DecisionTreeClassifier。两者都使用相同的训练数据进行训练。两者都为决策树学习了不同的最大深度。 decision_tree_m
python - 如何在 DecisionTreeClassifier 中设置类权重以进行多类设置
我正在使用 sklearn.tree.DecisionTreeClassifier 来训练 3-class 分类问题。 3个类的记录数如下: A: 122038 B: 43626 C: 6678 当我
python - “DecisionTreeClassifier”对象没有属性 'export_graphviz'
我正在使用 python sklearn RandomForestClassifier 并尝试导出决策树。基本代码如下: from sklearn import tree with open(dot
python - Sklearn DecisionTreeclassifier 返回不可能的分割值
我正在尝试使用 DataFrame(pandas)从 sklearn 实现 DecisionTreeClassifier，但在分割数据时它返回一些奇怪的值。我的数据集包含 3 列，其 PIL 逊相关系
python - DecisionTreeClassifier fit() 返回具有相同数据的不同树
我一直在玩 sklearn 并使用虹膜数据在线遵循一些简单的示例。我现在开始使用一些其他数据。我不确定这种行为是否正确，而且我有误解，但每次我调用 fit(x,y) 时，我都会得到完全不同的树数据。
python - sklearn DecisionTreeClassifier 真的可以处理分类数据吗？
使用 DecisionTreeClassifier I visualized it using graphviz 时，我不得不说，令我惊讶的是，它似乎采用分类数据并将其用作连续数据。我的所有特征都是
python - sklearn DecisionTreeClassifier 使用应被视为分类的字符串
我正在训练一个 sklearn.tree.DecisionTreeClassifier。我从 pandas.core.frame.DataFrame 开始。这个数据框的一些列是真正应该是分类的字符串。
python - 为 DecisionTreeClassifier 传递参数时出错
我正在尝试使用字符串中的参数的 DecisionTreeClassifier。 print d # d= 'max_depth=100' clf = DecisionTreeClassifi
python - 如何使用 DecisionTreeClassifier 来平衡分类？
我有一个数据集，其中的类是不平衡的。这些类是 0、1 或 2。如何计算每个类别的预测误差，然后在 scikit-learn 中相应地重新平衡权重？最佳答案如果你想完全平衡(将每个类视为同等重要)
python - Python的sklearn(DecisionTreeClassifier，SVM)之间的区别？
我是机器学习新手 - 特别是分类技术。我已经在线阅读了一些教程，并且正在使用 iris data set 。我尝试使用将数据集拆分为训练和测试 train, test = train_test_s
machine-learning - ML DecisionTreeClassifier - 连续特征
如何告诉 ml.DecisionTreeClassifier 对连续特征而不是分类特征进行评分，而无需使用 Bucketizer 或 QuantileDiscretizer 方法？下面是我将连续特征
python - scikit 分类器的准确率非常低(朴素贝叶斯、DecisionTreeClassifier)
我正在使用这个数据集Weath Based on age并且文档指出准确度应在 84% 左右。不幸的是，我的程序的准确率是 25% 为了处理数据，我执行了以下操作: 1. Loaded the .tx
python - 通过 DecisionTreeClassifier sklearn 合并数据？
假设我有一个数据集: X y 20 0 22 0 24 1 27 0 30 1 40 1 20
Python - Graphviz - 删除 DecisionTreeClassifier 节点上的图例
我有一个来自 sklearn 的决策树分类器，我使用 pydotplus 来展示它。然而，当我的演示文稿(熵、样本和值)的每个节点上有很多信息时，我真的不喜欢。为了更容易向人们解释，我只想保留决定和
python - 从经过训练的 sklearn DecisionTreeClassifier 中的树节点预测的分类概率
我已经安装了 DecisionTreeClassifier 的实例，并且正在尝试提取每个节点的预测概率。我需要这个来创建类似于下面所示的自定义决策树可视化。我可以导出每个节点的特征和阈值。 dtc.
python - 在 sklearn DecisionTreeClassifier 中修剪不必要的叶子
我使用 sklearn.tree.DecisionTreeClassifier 构建决策树。通过最佳参数设置，我得到了一棵有不必要叶子的树(参见下面的示例图片 - 我不需要概率，所以用红色标记的叶子节

行者123

个人简介

我是一名优秀的程序员,十分优秀！

作者热门文章

滴滴打车优惠券免费领取

全站热门文章

首页

博学

6Ren·AI

商城

machine-learning - ML DecisionTreeClassifier - 连续特征