gpt4 book ai didi

apache-spark - MLlib 的输入格式问题

转载 作者:行者123 更新时间:2023-12-01 05:11:37 26 4
gpt4 key购买 nike

我想运行 SVM 回归,但输入格式有问题。现在,我为一位客户设置的训练和测试集如下所示:

1 '12262064 |f offer_quantity:1 
has_bought_brand_company:1 has_bought_brand_a:6.79 has_bought_brand_q_60:1.0
has_bought_brand:2.0 has_bought_company_a:1.95 has_bought_brand_180:1.0
has_bought_brand_q_180:1.0 total_spend:218.37 has_bought_brand_q:3.0 offer_value:1.5
has_bought_brand_a_60:2.79 has_bought_brand_60:1.0 has_bought_brand_q_90:1.0
has_bought_brand_a_90:2.79 has_bought_company_q:1.0 has_bought_brand_90:1.0
has_bought_company:1.0 never_bought_category:1 has_bought_brand_a_180:2.79

如果尝试将此文本文件读入 Spark,但没有成功。我错过了什么?我必须删除功能名称吗?现在它是 Vowal Wabbit 格式。

我的代码如下所示:
import org.apache.spark.SparkContext
import org.apache.spark.mllib.classification.SVMWithSGD
import org.apache.spark.mllib.evaluation.BinaryClassificationMetrics
import org.apache.spark.mllib.regression.LabeledPoint
import org.apache.spark.mllib.linalg.Vectors
import org.apache.spark.mllib.util.MLUtils

Load training data in LIBSVM format.
val data = MLUtils.loadLibSVMFile(sc, "mllib/data/train.txt")

Split data into training (60%) and test (40%).
val splits = data.randomSplit(Array(0.6, 0.4), seed = 11L)
val training = splits(0).cache()
val test = splits(1)

Run training algorithm to build the model
val numIterations = 100
val model = SVMWithSGD.train(training, numIterations)


model.clearThreshold()


val scoreAndLabels = test.map { point =>
val score = model.predict(point.features)
(score, point.label)
}


val metrics = new BinaryClassificationMetrics(scoreAndLabels)
val auROC = metrics.areaUnderROC()

println("Area under ROC = " + auROC)

``我得到了答案,但我的 AUC 值为 1,这不应该是这种情况。
scala> println("Area under ROC = " + auROC)
Area under ROC = 1.0

最佳答案

我认为您的文件不是 LIBSVM 格式。如果您可以将文件转换为 libsvm 格式
或者
您必须将其作为普通文件加载,然后创建一个标签点
这就是我为我的文件所做的。

import org.apache.spark.mllib.feature.HashingTF
val tf = new HashingTF(2)
val tweets = sc.textFile(tweetInput)

val labelPoint = tweets.map(l=>{
val parts = l.split(' ')
var t=tf.transform(parts.tail.map(x => x).sliding(2).toSeq)
LabeledPoint(parts(0).toDouble,t )

}).cache()
labelPoint.count()

val model = LinearRegressionWithSGD.train(labelPoint, numIterations)

关于apache-spark - MLlib 的输入格式问题,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/24454688/

26 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com