gpt4 book ai didi

scala - 如何从 Spark ML 随机森林中获取对应于类的概率

转载 作者:行者123 更新时间:2023-12-04 04:42:21 25 4
gpt4 key购买 nike

我一直在使用 org.apache.spark.ml.Pipeline 进行机器学习任务。了解实际概率而不仅仅是预测的标签尤其重要,我很难得到它。在这里,我正在使用随机森林进行二元分类任务。类标签是"is"和“否”。我想输出标签 "Yes"的概率。概率作为管道输出存储在 DenseVector 中,例如 [0.69, 0.31],但我不知道哪个对应于"is"(0.69 或 0.31?)。我想应该有办法从 labelIndexer 中检索它吗?

这是我训练模型的任务代码

val sc = new SparkContext(new SparkConf().setAppName(" ML").setMaster("local"))
val data = .... // load data from file
val df = sqlContext.createDataFrame(data).toDF("label", "features")
val labelIndexer = new StringIndexer()
.setInputCol("label")
.setOutputCol("indexedLabel")
.fit(df)

val featureIndexer = new VectorIndexer()
.setInputCol("features")
.setOutputCol("indexedFeatures")
.setMaxCategories(2)
.fit(df)


// Convert indexed labels back to original labels.
val labelConverter = new IndexToString()
.setInputCol("prediction")
.setOutputCol("predictedLabel")
.setLabels(labelIndexer.labels)

val Array(trainingData, testData) = df.randomSplit(Array(0.7, 0.3))


// Train a RandomForest model.
val rf = new RandomForestClassifier()
.setLabelCol("indexedLabel")
.setFeaturesCol("indexedFeatures")
.setNumTrees(10)
.setFeatureSubsetStrategy("auto")
.setImpurity("gini")
.setMaxDepth(4)
.setMaxBins(32)

// Create pipeline
val pipeline = new Pipeline()
.setStages(Array(labelIndexer, featureIndexer, rf,labelConverter))

// Train model
val model = pipeline.fit(trainingData)

// Save model
sc.parallelize(Seq(model), 1).saveAsObjectFile("/my/path/pipeline")

然后我将加载管道并对新数据进行预测,这是代码片段
// Ignoring loading data part

// Create DF
val testdf = sqlContext.createDataFrame(testData).toDF("features", "line")
// Load pipeline
val model = sc.objectFile[org.apache.spark.ml.PipelineModel]("/my/path/pipeline").first

// My Question comes here : How to extract the probability that corresponding to class label "1"
// This is my attempt, I would like to output probability for label "Yes" and predicted label . The probabilities are stored in a denseVector, but I don't know which one is corresponding to "Yes". Something like this:
val predictions = model.transform(testdf).select("probability").map(e=> e.asInstanceOf[DenseVector])

有关 RF 的概率和标签的引用资料:
http://spark.apache.org/docs/latest/ml-classification-regression.html#random-forests

最佳答案

你的意思是你想在 DenseVector 中提取正标签的概率?如果是这样,您可以创建一个 udf 函数来求解概率。
在二元分类的 DenseVector 中,第一列代表“0”的概率,第二列代表“1”的概率。

val prediction = pipelineModel.transform(result)
val pre = prediction.select(getOne($"probability")).withColumnRenamed("UDF(probability)","probability")

关于scala - 如何从 Spark ML 随机森林中获取对应于类的概率,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/35640869/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com