gpt4 book ai didi

tree - 使用树输出在 Spark 中的梯度提升树的情况下预测类的概率

转载 作者:行者123 更新时间:2023-12-04 20:35:11 24 4
gpt4 key购买 nike

众所周知,Spark 中的 GBT 为您提供了截至目前的预测标签。

我正在考虑尝试计算一个类的预测概率(比如所有实例都落在某个叶子下)

构建 GBT 的代码

import org.apache.spark.SparkContext
import org.apache.spark.mllib.regression.LabeledPoint
import org.apache.spark.mllib.linalg.Vectors
import org.apache.spark.mllib.tree.GradientBoostedTrees
import org.apache.spark.mllib.tree.configuration.BoostingStrategy
import org.apache.spark.mllib.tree.model.GradientBoostedTreesModel
import org.apache.spark.mllib.util.MLUtils

//Importing the data
val data = sc.textFile("data/mllib/credit_approval_2_attr.csv") //using the credit approval data set from UCI machine learning repository

//Parsing the data
val parsedData = data.map { line =>
val parts = line.split(',').map(_.toDouble)
LabeledPoint(parts(0), Vectors.dense(parts.tail))
}

//Splitting the data
val splits = parsedData.randomSplit(Array(0.7, 0.3), seed = 11L)
val training = splits(0).cache()
val test = splits(1)

// Train a GradientBoostedTrees model.
// The defaultParams for Classification use LogLoss by default.
val boostingStrategy = BoostingStrategy.defaultParams("Classification")
boostingStrategy.numIterations = 2 // We can use more iterations in practice.
boostingStrategy.treeStrategy.numClasses = 2
boostingStrategy.treeStrategy.maxDepth = 2
boostingStrategy.treeStrategy.maxBins = 32
boostingStrategy.treeStrategy.subsamplingRate = 0.5
boostingStrategy.treeStrategy.maxMemoryInMB =1024
boostingStrategy.learningRate = 0.1

// Empty categoricalFeaturesInfo indicates all features are continuous.
boostingStrategy.treeStrategy.categoricalFeaturesInfo = Map[Int, Int]()

val model = GradientBoostedTrees.train(training, boostingStrategy)

model.toDebugString

为简单起见,这给了我 2 个深度为 2 的树,如下所示:
 Tree 0:
If (feature 3 <= 2.0)
If (feature 2 <= 1.25)
Predict: -0.5752212389380531
Else (feature 2 > 1.25)
Predict: 0.07462686567164178
Else (feature 3 > 2.0)
If (feature 0 <= 30.17)
Predict: 0.7272727272727273
Else (feature 0 > 30.17)
Predict: 1.0
Tree 1:
If (feature 5 <= 67.0)
If (feature 4 <= 100.0)
Predict: 0.5739387416147804
Else (feature 4 > 100.0)
Predict: -0.550117566730937
Else (feature 5 > 67.0)
If (feature 2 <= 0.0)
Predict: 3.0383669122382835
Else (feature 2 > 0.0)
Predict: 0.4332824083446489

我的问题是:我可以使用上述树来计算预测概率,例如:

关于用于预测的特征集中的每个实例

exp(树0的叶子分数+树1的叶子分数)/(1+exp(树0的叶子分数+树1的叶子分数))

这给了我一种概率。但不确定这是否是正确的方法。此外,如果有任何文件解释如何计算叶分数(预测)。如果有人可以分享,我将不胜感激。

任何建议都会很棒。

最佳答案

这是我使用 Spark 内部依赖项的方法。稍后您将需要为矩阵运算导入线性代数库,即将树预测与学习率相乘。

import org.apache.spark.mllib.linalg.{Vectors, Matrices}
import org.apache.spark.mllib.linalg.distributed.{RowMatrix}

假设您使用 GBT 构建模型:

val model = GradientBoostedTrees.train(trainingData, boostingStrategy)

要使用模型对象计算概率:

// Get the log odds predictions from each tree
val treePredictions = testData.map { point => model.trees.map(_.predict(point.features)) }

// Transform the arrays into matrices for multiplication
val treePredictionsVector = treePredictions.map(array => Vectors.dense(array))
val treePredictionsMatrix = new RowMatrix(treePredictionsVector)
val learningRate = model.treeWeights
val learningRateMatrix = Matrices.dense(learningRate.size, 1, learningRate)
val weightedTreePredictions = treePredictionsMatrix.multiply(learningRateMatrix)

// Calculate probability by ensembling the log odds
val classProb = weightedTreePredictions.rows.flatMap(_.toArray).map(x => 1 / (1 + Math.exp(-1 * x)))
classProb.collect

// You may tweak your decision boundary for different class labels
val classLabel = classProb.map(x => if (x > 0.5) 1.0 else 0.0)
classLabel.collect

这是一个代码片段,您可以将其直接复制并粘贴到 spark-shell 中:

import org.apache.spark.mllib.regression.LabeledPoint
import org.apache.spark.mllib.linalg.{Vectors, Matrices}
import org.apache.spark.mllib.linalg.distributed.{RowMatrix}
import org.apache.spark.mllib.tree.GradientBoostedTrees
import org.apache.spark.mllib.tree.configuration.BoostingStrategy
import org.apache.spark.mllib.tree.model.GradientBoostedTreesModel

// Load and parse the data file.
val csvData = sc.textFile("data/mllib/sample_tree_data.csv")
val data = csvData.map { line =>
val parts = line.split(',').map(_.toDouble)
LabeledPoint(parts(0), Vectors.dense(parts.tail))
}
// Split the data into training and test sets (30% held out for testing)
val splits = data.randomSplit(Array(0.7, 0.3))
val (trainingData, testData) = (splits(0), splits(1))

// Train a GBT model.
val boostingStrategy = BoostingStrategy.defaultParams("Classification")
boostingStrategy.numIterations = 50
boostingStrategy.treeStrategy.numClasses = 2
boostingStrategy.treeStrategy.maxDepth = 6
boostingStrategy.treeStrategy.categoricalFeaturesInfo = Map[Int, Int]()

val model = GradientBoostedTrees.train(trainingData, boostingStrategy)

// Get class label from raw predict function
val predictedLabels = model.predict(testData.map(_.features))
predictedLabels.collect

// Get class probability
val treePredictions = testData.map { point => model.trees.map(_.predict(point.features)) }
val treePredictionsVector = treePredictions.map(array => Vectors.dense(array))
val treePredictionsMatrix = new RowMatrix(treePredictionsVector)
val learningRate = model.treeWeights
val learningRateMatrix = Matrices.dense(learningRate.size, 1, learningRate)
val weightedTreePredictions = treePredictionsMatrix.multiply(learningRateMatrix)
val classProb = weightedTreePredictions.rows.flatMap(_.toArray).map(x => 1 / (1 + Math.exp(-1 * x)))
val classLabel = classProb.map(x => if (x > 0.5) 1.0 else 0.0)
classLabel.collect

关于tree - 使用树输出在 Spark 中的梯度提升树的情况下预测类的概率,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/37303855/

24 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com