gpt4 book ai didi

scala - Spark 随机森林二元分类器指标

转载 作者:行者123 更新时间:2023-12-04 18:03:33 27 4
gpt4 key购买 nike

在 Spark Mllib(F 分数、AUROC、AUPRC 等)中训练随机森林二元分类器模型时,我们如何获得模型指标?

问题是 BinaryClassificationMetrics采用概率,而 RandomForest 分类器的 predict 方法返回离散值 0 或 1。

见:https://spark.apache.org/docs/latest/mllib-evaluation-metrics.html#binary-classification

一个 RandomForest.trainClassifier没有任何clearThreshold使其返回概率而不是离散的 0 或 1 标签的方法。

最佳答案

我们需要使用新的ml基于 DataFrames 的 API 来获取概率,而不是基于 RDD mllib API。

更新

以下是 Spark 文档中使用 BinaryClassificationEvaluator 的更新示例并显示指标:Area Under Receiver Operating Characteristic (AUROC) 和 Area Under Precision Recall Curve (AUPRC)。

import org.apache.spark.ml.Pipeline
import org.apache.spark.ml.classification.RandomForestClassifier
import org.apache.spark.ml.evaluation.BinaryClassificationEvaluator
import org.apache.spark.ml.feature.{IndexToString, StringIndexer, VectorIndexer}

// Load and parse the data file, converting it to a DataFrame.
val data = sqlContext.read.format("libsvm").load("D:/Sources/spark/data/mllib/sample_libsvm_data.txt")

// Index labels, adding metadata to the label column.
// Fit on whole dataset to include all labels in index.
val labelIndexer = new StringIndexer()
.setInputCol("label")
.setOutputCol("indexedLabel")
.fit(data)

// Automatically identify categorical features, and index them.
// Set maxCategories so features with > 4 distinct values are treated as continuous.
val featureIndexer = new VectorIndexer()
.setInputCol("features")
.setOutputCol("indexedFeatures")
.setMaxCategories(4)
.fit(data)

// Split the data into training and test sets (30% held out for testing)
val Array(trainingData, testData) = data.randomSplit(Array(0.7, 0.3))

// Train a RandomForest model.
val rf = new RandomForestClassifier()
.setLabelCol("indexedLabel")
.setFeaturesCol("indexedFeatures")
.setNumTrees(10)

// Convert indexed labels back to original labels.
val labelConverter = new IndexToString()
.setInputCol("prediction")
.setOutputCol("predictedLabel")
.setLabels(labelIndexer.labels)

// Chain indexers and forest in a Pipeline
val pipeline = new Pipeline()
.setStages(Array(labelIndexer, featureIndexer, rf, labelConverter))

// Train model. This also runs the indexers.
val model = pipeline.fit(trainingData)

// Make predictions.
val predictions = model.transform(testData)

// Select example rows to display.
predictions
.select("indexedLabel", "rawPrediction", "prediction")
.show()

val binaryClassificationEvaluator = new BinaryClassificationEvaluator()
.setLabelCol("indexedLabel")
.setRawPredictionCol("rawPrediction")

def printlnMetric(metricName: String): Unit = {
println(metricName + " = " + binaryClassificationEvaluator.setMetricName(metricName).evaluate(predictions))
}

printlnMetric("areaUnderROC")
printlnMetric("areaUnderPR")

关于scala - Spark 随机森林二元分类器指标,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/37566321/

27 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com