python - org.apache.spark.SparkException : Unseen label with TrainValidationSplit-6ren

python - org.apache.spark.SparkException : Unseen label with TrainValidationSplit

转载作者：行者123 更新时间：2023-11-30 09:35:37

27

4

我正在搜索此错误，但没有找到与 TrainValidationSplit 相关的任何内容。所以我想进行参数调整，并使用 TrainValidationSplit 执行此操作会出现以下错误:org.apache.spark.SparkException:Unseen label。

我理解为什么会发生这种情况，增加trainRatio可以缓解问题，但不能完全解决问题。就此而言，这是代码(部分):

stages = []
for categoricalCol in categoricalCols:
    stringIndexer = StringIndexer(inputCol=categoricalCol, outputCol=categoricalCol+"Index")
    stages += [stringIndexer]

assemblerInputs = [x+"Index" for x in categoricalCols] + numericCols
assembler = VectorAssembler(inputCols=assemblerInputs, outputCol="features")
stages += [assembler]

labelIndexer = StringIndexer(inputCol='label', outputCol='indexedLabel')
stages += [labelIndexer]

dt = DecisionTreeClassifier(labelCol="indexedLabel", featuresCol="features")
stages += [dt]

evaluator = MulticlassClassificationEvaluator(labelCol='indexedLabel', predictionCol='prediction', metricName='f1')

paramGrid = (ParamGridBuilder()
             .addGrid(dt.maxDepth, [1,2,6])
             .addGrid(dt.maxBins, [20,40])
             .build())

pipeline = Pipeline(stages=stages)

trainValidationSplit = TrainValidationSplit(estimator=pipeline, estimatorParamMaps=paramGrid, evaluator=evaluator, trainRatio=0.95)

model = trainValidationSplit.fit(train_dataset)
train_dataset= model.transform(train_dataset)

我看过这个answer但我不确定它是否也适用于我的情况，我想知道是否有更合适的解决方案。请帮忙？

最佳答案

Unseen label 异常通常与 StringIndexer 相关。

您将数据分为训练数据集 (95%) 和验证数据集 (5%)。我认为有一些类别值(在 categoricalCol 列中)出现在训练数据中，但没有出现在验证集中。

因此，在验证过程的字符串索引阶段，StringIndexer 会看到一个看不见的标签并抛出该异常。通过增加训练比率，您增加了训练集中的类别值是验证集中的类别值的超集的机会，但这只是一种解决方法，因为无法保证。

一种可能的解决方案:首先使用train_dataset拟合StringIndexer，然后添加生成的StringIndexerModel 到管道阶段。这样 StringIndexer 将看到所有可能的类别值。

for categoricalCol in categoricalCols:
    stringIndexer = StringIndexer(inputCol=categoricalCol, outputCol=categoricalCol+"Index")
    strIndexModel = stringIndexer.fit(train_dataset)
    stages += [strIndexModel]

关于python - org.apache.spark.SparkException : Unseen label with TrainValidationSplit，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/43662786/

27

4

0

文章推荐： javascript - 如何检查url字符串javascript中是否存在url方案

文章推荐： java - 同步访问内部对象的方法？

文章推荐： javascript - 如何在 pixijs 中应用 png mask ？

scala - 获取 TrainValidationSplit scala 的最佳参数
我正在使用 Spark Scala ML API，我正在尝试将管道 ALS 模型传递给 TrainValidationSplit。代码执行但我无法检索最佳参数...想法？ val alsPipelin
apache-spark - 通过 pyspark.ml.tuning.TrainValidationSplit 调整后如何获得最佳参数？
我正在尝试调整 Spark (PySpark) 的超参数 ALS模型来自 TrainValidationSplit . 它运作良好，但我想知道哪种超参数组合是最好的。评估后如何获得最佳参数？ from
python - org.apache.spark.SparkException : Unseen label with TrainValidationSplit
我正在搜索此错误，但没有找到与 TrainValidationSplit 相关的任何内容。所以我想进行参数调整，并使用 TrainValidationSplit 执行此操作会出现以下错误:org.ap

首页

博学

6Ren·AI

商城

python - org.apache.spark.SparkException : Unseen label with TrainValidationSplit