gpt4 book ai didi

apache-spark - 管道中的 Spark 访问估计器

转载 作者:行者123 更新时间:2023-12-04 05:15:01 25 4
gpt4 key购买 nike

类似于Is it possible to access estimator attributes in spark.ml pipelines?我想访问估算器,例如管道中的最后一个元素。

那里提到的方法似乎不再适用于 spark 2.0.1。现在效果如何?

编辑

也许我应该更详细地解释一下:这是我的估算器 + 向量汇编器:

val numRound = 20
val numWorkers = 4
val xgbBaseParams = Map(
"max_depth" -> 10,
"eta" -> 0.1,
"seed" -> 50,
"silent" -> 1,
"objective" -> "binary:logistic"
)

val xgbEstimator = new XGBoostEstimator(xgbBaseParams)
.setFeaturesCol("features")
.setLabelCol("label")

val vectorAssembler = new VectorAssembler()
.setInputCols(train.columns
.filter(!_.contains("label")))
.setOutputCol("features")

val simplePipeParams = new ParamGridBuilder()
.addGrid(xgbEstimator.round, Array(numRound))
.addGrid(xgbEstimator.nWorkers, Array(numWorkers))
.build()

val simplPipe = new Pipeline()
.setStages(Array(vectorAssembler, xgbEstimator))

val numberOfFolds = 2
val cv = new CrossValidator()
.setEstimator(simplPipe)
.setEvaluator(new BinaryClassificationEvaluator()
.setLabelCol("label")
.setRawPredictionCol("prediction"))
.setEstimatorParamMaps(simplePipeParams)
.setNumFolds(numberOfFolds)
.setSeed(gSeed)

val cvModel = cv.fit(train)
val trainPerformance = cvModel.transform(train)
val testPerformance = cvModel.transform(test)

现在我想执行自定义评分,例如!= 0.5 截止点。如果我掌握了模型,这是可能的:

val realModel = cvModel.bestModel.asInstanceOf[XGBoostClassificationModel]

但是这里这一步不编译​​。感谢您的建议,我可以获得模型:

 val pipelineModel: Option[PipelineModel] = cvModel.bestModel match {
case p: PipelineModel => Some(p)
case _ => None
}

val realModel: Option[XGBoostClassificationModel] = pipelineModel
.flatMap {
_.stages.collect { case t: XGBoostClassificationModel => t }
.headOption
}
// TODO write it nicer
val measureResults = realModel.map {
rm =>
{
for (
thresholds <- Array(Array(0.2, 0.8), Array(0.3, 0.7), Array(0.4, 0.6),
Array(0.6, 0.4), Array(0.7, 0.3), Array(0.8, 0.2))
) {
rm.setThresholds(thresholds)

val predResult = rm.transform(test)
.select("label", "probabilities", "prediction")
.as[LabelledEvaluation]
println("cutoff was ", thresholds)
calculateEvaluation(R, predResult)
}
}
}

但是,问题在于

val predResult = rm.transform(test)

将失败,因为 train 不包含 vectorAssembler 的特征列。 此列仅在运行完整管道时创建。

所以我决定创建第二个管道:

val scoringPipe = new Pipeline()
.setStages(Array(vectorAssembler, rm))
val predResult = scoringPipe.fit(train).transform(test)

但这似乎有点笨拙。你有更好/更好的主意吗?

最佳答案

Spark 2.0.0 中没有任何变化,同样的方法也有效。 Example Pipeline :

import org.apache.spark.ml.{Pipeline, PipelineModel}
import org.apache.spark.ml.classification.LogisticRegression
import org.apache.spark.ml.feature.{HashingTF, Tokenizer}
import org.apache.spark.ml.linalg.Vector
import org.apache.spark.sql.Row

// Prepare training documents from a list of (id, text, label) tuples.
val training = spark.createDataFrame(Seq(
(0L, "a b c d e spark", 1.0),
(1L, "b d", 0.0),
(2L, "spark f g h", 1.0),
(3L, "hadoop mapreduce", 0.0)
)).toDF("id", "text", "label")

// Configure an ML pipeline, which consists of three stages: tokenizer, hashingTF, and lr.
val tokenizer = new Tokenizer()
.setInputCol("text")
.setOutputCol("words")
val hashingTF = new HashingTF()
.setNumFeatures(1000)
.setInputCol(tokenizer.getOutputCol)
.setOutputCol("features")
val lr = new LogisticRegression()
.setMaxIter(10)
.setRegParam(0.01)
val pipeline = new Pipeline()
.setStages(Array(tokenizer, hashingTF, lr))

// Fit the pipeline to training documents.
val model = pipeline.fit(training)

和模型:

val logRegModel = model.stages.last
.asInstanceOf[org.apache.spark.ml.classification.LogisticRegressionModel]

关于apache-spark - 管道中的 Spark 访问估计器,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/40557604/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com