gpt4 book ai didi

scala - 如何将 spark DataFrame 转换为 RDD mllib LabeledPoints?

转载 作者:行者123 更新时间:2023-12-04 01:58:55 24 4
gpt4 key购买 nike

我尝试将 PCA 应用于我的数据,然后将 RandomForest 应用于转换后的数据。但是,PCA.transform(data) 给了我一个 DataFrame,但我需要一个 mllib LabeledPoints 来提供我的 RandomForest。我怎样才能做到这一点?
我的代码:

    import org.apache.spark.mllib.util.MLUtils
import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.mllib.tree.RandomForest
import org.apache.spark.mllib.tree.model.RandomForestModel
import org.apache.spark.ml.feature.PCA
import org.apache.spark.mllib.regression.LabeledPoint
import org.apache.spark.mllib.linalg.Vectors


val dataset = MLUtils.loadLibSVMFile(sc, "data/mnist/mnist.bz2")

val splits = dataset.randomSplit(Array(0.7, 0.3))

val (trainingData, testData) = (splits(0), splits(1))

val trainingDf = trainingData.toDF()

val pca = new PCA()
.setInputCol("features")
.setOutputCol("pcaFeatures")
.setK(100)
.fit(trainingDf)

val pcaTrainingData = pca.transform(trainingDf)

val numClasses = 10
val categoricalFeaturesInfo = Map[Int, Int]()
val numTrees = 10 // Use more in practice.
val featureSubsetStrategy = "auto" // Let the algorithm choose.
val impurity = "gini"
val maxDepth = 20
val maxBins = 32

val model = RandomForest.trainClassifier(pcaTrainingData, numClasses, categoricalFeaturesInfo,
numTrees, featureSubsetStrategy, impurity, maxDepth, maxBins)


error: type mismatch;
found : org.apache.spark.sql.DataFrame
required: org.apache.spark.rdd.RDD[org.apache.spark.mllib.regression.LabeledPoint]

我尝试了以下两种可能的解决方案,但没有奏效:
 scala> val pcaTrainingData = trainingData.map(p => p.copy(features = pca.transform(p.features)))
<console>:39: error: overloaded method value transform with alternatives:
(dataset: org.apache.spark.sql.DataFrame)org.apache.spark.sql.DataFrame <and>
(dataset: org.apache.spark.sql.DataFrame,paramMap: org.apache.spark.ml.param.ParamMap)org.apache.spark.sql.DataFrame <and>
(dataset: org.apache.spark.sql.DataFrame,firstParamPair: org.apache.spark.ml.param.ParamPair[_],otherParamPairs: org.apache.spark.ml.param.ParamPair[_]*)org.apache.spark.sql.DataFrame
cannot be applied to (org.apache.spark.mllib.linalg.Vector)

和:
     val labeled = pca
.transform(trainingDf)
.map(row => LabeledPoint(row.getDouble(0), row(4).asInstanceOf[Vector[Int]]))

error: type mismatch;
found : scala.collection.immutable.Vector[Int]
required: org.apache.spark.mllib.linalg.Vector

(我在上面的例子中导入了 org.apache.spark.mllib.linalg.Vectors)

有什么帮助吗?

最佳答案

这里的正确方法是您尝试的第二个方法 - 将每个 Row 映射到 LabeledPoint 以获得 RDD[LabeledPoint] 。但是,它有两个错误:

  • 正确的 Vector 类( org.apache.spark.mllib.linalg.Vector )不接受类型参数(例如 Vector[Int] ) - 所以即使你有正确的导入,编译器得出结论你的意思是 scala.collection.immutable.Vector
  • PCA.fit() 返回的 DataFrame 有 3 列,您尝试提取列号 4。例如,显示前 4 行:
    +-----+--------------------+--------------------+
    |label| features| pcaFeatures|
    +-----+--------------------+--------------------+
    | 5.0|(780,[152,153,154...|[880.071111851977...|
    | 1.0|(780,[158,159,160...|[-41.473039034112...|
    | 2.0|(780,[155,156,157...|[931.444898405036...|
    | 1.0|(780,[124,125,126...|[25.5114585648411...|
    +-----+--------------------+--------------------+

    为了使这更容易 - 我更喜欢使用列名而不是它们的索引。

  • 所以这是您需要的转换:
    val labeled = pca.transform(trainingDf).rdd.map(row => LabeledPoint(
    row.getAs[Double]("label"),
    row.getAs[org.apache.spark.mllib.linalg.Vector]("pcaFeatures")
    ))

    关于scala - 如何将 spark DataFrame 转换为 RDD mllib LabeledPoints?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/35966921/

    24 4 0
    Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
    广告合作:1813099741@qq.com 6ren.com