- html - 出于某种原因,IE8 对我的 Sass 文件中继承的 html5 CSS 不友好?
- JMeter 在响应断言中使用 span 标签的问题
- html - 在 :hover and :active? 上具有不同效果的 CSS 动画
- html - 相对于居中的 html 内容固定的 CSS 重复背景?
如何告诉 ml.DecisionTreeClassifier 对连续特征而不是分类特征进行评分,而无需使用 Bucketizer 或 QuantileDiscretizer 方法?
下面是我将连续特征传递到 ML 中的 DecisionTreeClassifier 的代码,如果没有对特征进行 Binning (Buckizer),则大多数评分集将被忽略而不是评分(spark 2.1 不支持 keep)。
from pyspark.mllib.linalg import Vectors
from pyspark.ml import Pipeline
from pyspark.sql import Row, SparkSession, SQLContext
from pyspark.sql.types import StringType, DoubleType
from pyspark.ml.feature import StringIndexer, VectorAssembler, OneHotEncoder
from pyspark.ml.classification import DecisionTreeClassifier
from pyspark import SparkConf, SparkContext
from pyspark.sql.functions import udf
# Load the training set that is in parquet format into a data frame
train_df = sqlContext.read.parquet("/data/training_set")
# convert data types to double
train_df.withColumn("income", train_df["income"].cast(DoubleType())
train_df.withColumn("age", train_df["age"].cast(DoubleType())
# StringIndexer - Target
# First we will StringIndexer to get numeric categorical features
indexer1 = StringIndexer(inputCol="target", outputCol="target_numeric", handleInvalid="skip")
############
# StringIndexer/OneHotEncoder - age_in_two_year_increments_2nd_individual
# First we will StringIndexer to get numeric categorical features
indexer2 = StringIndexer(inputCol="income", outputCol="income_numeric", handleInvalid='skip')
# Next we change the categorical feature into binarizing via OneHotEncoder
encoder2 = OneHotEncoder(inputCol="income_numeric", outputCol="income_vector")
############
############
# StringIndexer/OneHotEncoder - age_in_two_year_increments_2nd_individual
# First we will StringIndexer to get numeric categorical features
indexer3 = StringIndexer(inputCol="age", outputCol="age_numeric", handleInvalid='skip')
# Next we change the categorical feature into binarizing via OneHotEncoder
encoder3 = OneHotEncoder(inputCol="age_numeric", outputCol="age_vector")
############
# Create distinct StringIndexer transformers with the outputCol
# parameter set to be the name of the input column appended
indexedcols = [
"income_vector",
"age_vector"
]
# FEATURES need to be in a Vector which is why this is converted using a VectorAssembler
# The VectorAssember is going to take as input our index columns and our output will be the features.
# Create a VectorAssembler transformer to combine all of the indexed
# categorical features into a vector. Provide the "indexedcols" list
# created above as the inputCols parameter, and name the outputCol "features".
va = VectorAssembler(inputCols = indexedcols, outputCol = 'features')
# Create a DecisionTreeClassifier, setting the label column to your
# indexed label column ("label_ix") and the features column to the
# newly created column from the VectorAssembler above ("features").
# Store the new StringIndexer transformers, the VectorAssembler,
# as well as the DecisionTreeClassifier in a list called "steps"
clf = DecisionTreeClassifier(labelCol="target_numeric", impurity="gini", maxBins=32, maxMemoryInMB=1024)
# Create steps for transform for the ml pipeline
steps = [indexer1,
indexer2, encoder2,
indexer3, encoder3,
va, clf]
# Create a ML pipeline named "pl" using the steps list to set the stages parameter
pl = Pipeline(stages=steps)
# Run the fit method of the pipeline on the DataFrame
# model in a new variable called "plmodel"
plmodel = pl.fit(train_df)
######################################################################################
# Scoring Set
######################################################################################
# Now get the data you want to run the model against
scoring_df = sqlContext.read.parquet("/data/scoring_set")
# convert data types to double
scoring_df.withColumn("income", scoring_df["income"].cast(DoubleType())
scoring_df.withColumn("age", scoring_df["age"].cast(DoubleType())
# Run the transform method of the pipeline model created above
# on the "test_df" DataFrame to create a new DataFrame called "predictions"
#
# skip past any labels that are not in the training set. If you don't skip then errors will be produced
#saying unseen label:40 which means the scoring set has a new element that did not exist in the training set for the feature.
predictions = plmodel.transform(scoring_DF)
vector_udf1 = udf(lambda vector: float(vector[1]))
vector_udf0 = udf(lambda vector: float(vector[0]))
# Save dataframe to hdfs
outputDF = predictions.select("age", \
"income", \
"prediction", \
vector_udf1("probability").alias("probability0")), \
vector_udf1("probability").alias("probability1")).write.format("parquet").mode("overwrite").save("/data/algo_scored")
最佳答案
对于连续特征,无需使用 Bucketizer 或 QuantileDiscretizer。对于分类特征,您可以使用 StringIndexer 和 OneHotEncoder 并将其包含在管道中,但对于连续特征,您只需要使用 VectorAssembler 指定特征,DecisionTreeClassifier 将自动对特征进行分类。
所以代码看起来像:
from pyspark.mllib.linalg import Vectors
from pyspark.ml import Pipeline
from pyspark.sql import Row, SparkSession, SQLContext
from pyspark.sql.types import StringType, DoubleType
from pyspark.ml.feature import StringIndexer, VectorAssembler, OneHotEncoder
from pyspark.ml.classification import DecisionTreeClassifier
from pyspark import SparkConf, SparkContext
from pyspark.sql.functions import udf
# Load the training set that is in parquet format into a data frame
train_df = sqlContext.read.parquet("/data/training_set")
# convert data types to double
train_df.withColumn("income", train_df["income"].cast(DoubleType())
train_df.withColumn("age", train_df["age"].cast(DoubleType())
# StringIndexer - Target
# First we will StringIndexer to get numeric categorical features
indexer1 = StringIndexer(inputCol="target", outputCol="target_numeric", handleInvalid="skip")
# Create distinct StringIndexer transformers with the outputCol
# parameter set to be the name of the input column appended
indexedcols = [
"income",
"age"
]
# FEATURES need to be in a Vector which is why this is converted using a VectorAssembler
# The VectorAssember is going to take as input our index columns and our output will be the features.
# Create a VectorAssembler transformer to combine all of the indexed
# categorical features into a vector. Provide the "indexedcols" list
# created above as the inputCols parameter, and name the outputCol "features".
va = VectorAssembler(inputCols = indexedcols, outputCol = 'features')
# Create a DecisionTreeClassifier, setting the label column to your
# indexed label column ("label_ix") and the features column to the
# newly created column from the VectorAssembler above ("features").
# Store the new StringIndexer transformers, the VectorAssembler,
# as well as the DecisionTreeClassifier in a list called "steps"
clf = DecisionTreeClassifier(labelCol="target_numeric", impurity="gini", maxBins=32, maxMemoryInMB=1024)
# Create steps for transform for the ml pipeline
steps = [indexer1,
va, clf]
# Create a ML pipeline named "pl" using the steps list to set the stages parameter
pl = Pipeline(stages=steps)
# Run the fit method of the pipeline on the DataFrame
# model in a new variable called "plmodel"
plmodel = pl.fit(train_df)
######################################################################################
# Scoring Set
######################################################################################
# Now get the data you want to run the model against
scoring_df = sqlContext.read.parquet("/data/scoring_set")
# convert data types to double
scoring_df.withColumn("income", scoring_df["income"].cast(DoubleType())
scoring_df.withColumn("age", scoring_df["age"].cast(DoubleType())
# Run the transform method of the pipeline model created above
# on the "test_df" DataFrame to create a new DataFrame called "predictions"
#
# skip past any labels that are not in the training set. If you don't skip then errors will be produced
#saying unseen label:40 which means the scoring set has a new element that did not exist in the training set for the feature.
predictions = plmodel.transform(scoring_DF)
vector_udf1 = udf(lambda vector: float(vector[1]))
vector_udf0 = udf(lambda vector: float(vector[0]))
# Save dataframe to hdfs
outputDF = predictions.select("age", \
"income", \
"prediction", \
vector_udf1("probability").alias("probability0")), \
vector_udf1("probability").alias("probability1")).write.format("parquet").mode("overwrite").save("/data/algo_scored")
关于machine-learning - ML DecisionTreeClassifier - 连续特征,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/45379559/
我正在制作一个交互式建模工具。这个想法是用决策树生成变量。然而,这个变量需要具有经济意义(我希望能够删除理论上没有意义的分割)。因此,我用plotly绘制了一棵树,以便能够监听用户点击的位置。我在下面
我正在尝试对具有多个级别的对象进行分类。我解释它的最好方法是用一个例子: 我可以做到这一点: from sklearn import tree features = ['Hip Hop','Bosto
我正在使用 scikit-learn 中的 DecisionTreeClassifier 对一些数据进行分类。我还使用其他算法,并使用精确召回指标下的面积来比较它们。问题是 DecisionTreeC
有没有办法计算 decision tree 中两片叶子之间的距离? . 距离是指从一片叶子到另一片叶子的节点数。 例如,在此示例图中: distance(leaf1, leaf2) == 1 dist
我有两个学习过的sklearn.tree.tree.DecisionTreeClassifier。两者都使用相同的训练数据进行训练。两者都为决策树学习了不同的最大深度。 decision_tree_m
我正在使用 sklearn.tree.DecisionTreeClassifier 来训练 3-class 分类问题。 3个类的记录数如下: A: 122038 B: 43626 C: 6678 当我
我正在使用 python sklearn RandomForestClassifier 并尝试导出决策树。 基本代码如下: from sklearn import tree with open(dot
我正在尝试使用 DataFrame(pandas)从 sklearn 实现 DecisionTreeClassifier,但在分割数据时它返回一些奇怪的值。我的数据集包含 3 列,其 PIL 逊相关系
我一直在玩 sklearn 并使用虹膜数据在线遵循一些简单的示例。 我现在开始使用一些其他数据。我不确定这种行为是否正确,而且我有误解,但每次我调用 fit(x,y) 时,我都会得到完全不同的树数据。
使用 DecisionTreeClassifier I visualized it using graphviz 时,我不得不说,令我惊讶的是,它似乎采用分类数据并将其用作连续数据。 我的所有特征都是
我正在训练一个 sklearn.tree.DecisionTreeClassifier。我从 pandas.core.frame.DataFrame 开始。这个数据框的一些列是真正应该是分类的字符串。
我正在尝试使用字符串中的参数的 DecisionTreeClassifier。 print d # d= 'max_depth=100' clf = DecisionTreeClassifi
我有一个数据集,其中的类是不平衡的。这些类是 0、1 或 2。 如何计算每个类别的预测误差,然后在 scikit-learn 中相应地重新平衡权重? 最佳答案 如果你想完全平衡(将每个类视为同等重要)
我是机器学习新手 - 特别是分类技术。 我已经在线阅读了一些教程,并且正在使用 iris data set 。我尝试使用 将数据集拆分为训练和测试 train, test = train_test_s
如何告诉 ml.DecisionTreeClassifier 对连续特征而不是分类特征进行评分,而无需使用 Bucketizer 或 QuantileDiscretizer 方法? 下面是我将连续特征
我正在使用这个数据集Weath Based on age并且文档指出准确度应在 84% 左右。不幸的是,我的程序的准确率是 25% 为了处理数据,我执行了以下操作: 1. Loaded the .tx
假设我有一个数据集: X y 20 0 22 0 24 1 27 0 30 1 40 1 20
我有一个来自 sklearn 的决策树分类器,我使用 pydotplus 来展示它。然而,当我的演示文稿(熵、样本和值)的每个节点上有很多信息时,我真的不喜欢。 为了更容易向人们解释,我只想保留决定和
我已经安装了 DecisionTreeClassifier 的实例,并且正在尝试提取每个节点的预测概率。我需要这个来创建类似于下面所示的自定义决策树可视化。 我可以导出每个节点的特征和阈值。 dtc.
我使用 sklearn.tree.DecisionTreeClassifier 构建决策树。通过最佳参数设置,我得到了一棵有不必要叶子的树(参见下面的示例图片 - 我不需要概率,所以用红色标记的叶子节
我是一名优秀的程序员,十分优秀!