gpt4 book ai didi

斯卡拉 Spark : Split collection into several RDD?

转载 作者:行者123 更新时间:2023-12-02 13:48:13 25 4
gpt4 key购买 nike

是否有任何 Spark 函数允许根据某些具体情况将集合拆分为多个 RDD?这样的函数可以避免过度迭代。例如:

def main(args: Array[String]) {
val logFile = "file.txt"
val conf = new SparkConf().setAppName("Simple Application")
val sc = new SparkContext(conf)
val logData = sc.textFile(logFile, 2).cache()
val lineAs = logData.filter(line => line.contains("a")).saveAsTextFile("linesA.txt")
val lineBs = logData.filter(line => line.contains("b")).saveAsTextFile("linesB.txt")
}

在此示例中,我必须迭代“logData”两次才能将结果写入两个单独的文件中:

    val lineAs = logData.filter(line => line.contains("a")).saveAsTextFile("linesA.txt")
val lineBs = logData.filter(line => line.contains("b")).saveAsTextFile("linesB.txt")

如果有这样的东西就好了:

    val resultMap = logData.map(line => if line.contains("a") ("a", line) else if line.contains("b") ("b", line) else (" - ", line)
resultMap.writeByKey("a", "linesA.txt")
resultMap.writeByKey("b", "linesB.txt")

有这样的事吗?

最佳答案

也许这样的东西会起作用:

def singlePassMultiFilter[T](
rdd: RDD[T],
f1: T => Boolean,
f2: T => Boolean,
level: StorageLevel = StorageLevel.MEMORY_ONLY
): (RDD[T], RDD[T], Boolean => Unit) = {
val tempRDD = rdd mapPartitions { iter =>
val abuf1 = ArrayBuffer.empty[T]
val abuf2 = ArrayBuffer.empty[T]
for (x <- iter) {
if (f1(x)) abuf1 += x
if (f2(x)) abuf2 += x
}
Iterator.single((abuf1, abuf2))
}
tempRDD.persist(level)
val rdd1 = tempRDD.flatMap(_._1)
val rdd2 = tempRDD.flatMap(_._2)
(rdd1, rdd2, (blocking: Boolean) => tempRDD.unpersist(blocking))
}

请注意,调用 rdd1 的操作(或 rdd2 )将导致 tempRDD 被计算并持久化。这实际上相当于计算 rdd2 (分别 rdd1 )因为 flatMap 的开销在 rdd1 的定义中和rdd2我相信,这将是相当微不足道的。

您可以使用singlePassMultiFitler像这样:

val (rdd1, rdd2, cleanUp) = singlePassMultiFilter(rdd, f1, f2)
rdd1.persist() //I'm going to need `rdd1` more later...
println(rdd1.count)
println(rdd2.count)
cleanUp(true) //I'm done with `rdd2` and `rdd1` has been persisted so free stuff up...
println(rdd1.distinct.count)

显然,这可以扩展到任意数量的过滤器、过滤器集合等。

关于斯卡拉 Spark : Split collection into several RDD?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/27231524/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com