gpt4 book ai didi

apache-spark - Spark randomSplit - 每次运行的结果不一致

转载 作者:行者123 更新时间:2023-12-05 07:32:50 28 4
gpt4 key购买 nike

我正在尝试使用以下方法将数据集拆分为训练和非训练

inDataSet.randomSplit(weights.toArray, 0)

每次运行,我都会得到不同的结果。这是预期的吗?如果是这样,我怎样才能每次都获得相同百分比的行?

例如:Training Offer 随机拆分的权重为:ArrayBuffer(0.3, 0.7) - 为此我总共有 72 行,权重为 0.3,我预计大约有 21 行。有时我得到 23、29、19、4。请指导。

注意:我给的总权重1.0(0.3 + 0.7)不是为了归一化。

-- 另一个问题很有用,但那是在单次执行中。我运行了 N 次测试,每次都得到不同的结果集。

最佳答案

我输入的一个可能的实现(类似于第二条评论中的链接):

    def doTrainingOffer(inDataSet: Dataset[Row],
fieldName: String,
training_offer_list: List[(Long, Long, Int, String, String)]):
(Dataset[Row], Option[Dataset[Row]]) = {
println("Doing Training Offer!")

val randomDs = inDataSet
.withColumn("row_id", rank().over(Window.partitionBy().orderBy(fieldName)))
.orderBy(rand)

randomDs.cache()
val count = randomDs.count()
println(s"The total no of rows for this use case is: ${count}")

val trainedDatasets = new mutable.ArrayBuffer[Dataset[Row]]()
var startPos = 0l
var endPos = 0l
for (trainingOffer <- training_offer_list) {
val noOfRows = scala.math.round(count * trainingOffer._3 / 100.0)
endPos += noOfRows
println(s"for training offer id: ${trainingOffer._1} and percent of ${trainingOffer._3}, the start and end are ${startPos}, ${endPos}")
trainedDatasets += addTrainingData(randomDs.where(col("row_id") > startPos && col("row_id") <= endPos), trainingOffer)
startPos = endPos
}

val combinedDs = trainedDatasets.reduce(_ union _)
// (left over for other offer, trainedOffer)
(randomDs.join(combinedDs, Seq(field_name), "left_anti"), Option(combinedDs))
}

还有另一种可能的实现方式::

val randomDs = inDataSet.orderBy(rand)
randomDs.cache()
val count = randomDs.count()
println(s"The total no of rows for this use case is: ${count}")
val trainedDatasets = new mutable.ArrayBuffer[Dataset[Row]]()

for (trainingOffer <- training_offer_list) {
if (trainedDatasets.length > 1) {
val combinedDs = trainedDatasets.reduce(_ union _)
val remainderDs = randomDs.join(combinedDs, Seq(field_name), "left_anti")
trainedDatasets += addTrainingData(remainderDs.limit(scala.math.round(count * trainingOffer._3 / 100)), trainingOffer)
}
else if (trainedDatasets.length == 1) {
val remainderDs = randomDs.join(trainedDatasets(0), Seq(field_name), "left_anti")
trainedDatasets += addTrainingData(remainderDs.limit(scala.math.round(count * trainingOffer._3 / 100)), trainingOffer)
}
else {
val tDs = randomDs.limit(scala.math.round(count * trainingOffer._3 / 100))
trainedDatasets += addTrainingData(tDs, trainingOffer)
}
}

val combinedDs = trainedDatasets.reduce(_ union _)
// (left over for other offer, trainedOffer)
(randomDs.join(combinedDs, Seq(field_name), "left_anti"), Option(combinedDs))

关于apache-spark - Spark randomSplit - 每次运行的结果不一致,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/50979024/

28 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com