gpt4 book ai didi

scala - Spark缓存的RDD计算n次

转载 作者:行者123 更新时间:2023-12-03 17:30:41 24 4
gpt4 key购买 nike

我遇到了 Spark 应用程序的问题。这是我的代码的简化版本:

def main(args: Array[String]) {
// Initializing spark context
val sc = new SparkContext()
val nbExecutors = sc.getConf.getInt("spark.executor.instances", 3)
System.setProperty("spark.sql.shuffle.partitions", nbExecutors.toString)

// Getting files from TGZ archives
val archivesRDD: RDD[(String,PortableDataStream)] = utils.getFilesFromHDFSDirectory("/my/dir/*.tar.gz") // This returns an RDD of tuples containing (filename, inpustream)
val filesRDD: RDD[String] = archivesRDD.flatMap(tgzStream => {
logger.debug("Getting files from archive : "+tgzStream._1)
utils.getFilesFromTgzStream(tgzStream._2)
})

// We run the same process with 3 different "modes"
val modes = Seq("mode1", "mode2", "mode3")

// We cache the RDD before
val nb = filesRDD.cache().count()
logger.debug($nb + " files as input")

modes.map(mode => {
logger.debug("Processing files with mode : " + mode)
myProcessor.process(mode, filesRDD)
})

filesRDD.unpersist() // I tried with or without this

[...]
}

生成的日志是(例如以 3 个存档作为输入):

Getting files from archive : a

Getting files from archive : b

Getting files from archive : c

3 files as input

Processing files with mode : mode1

Getting files from archive : a

Getting files from archive : b

Getting files from archive : c

Processing files with mode : mode2

Getting files from archive : a

Getting files from archive : b

Getting files from archive : c

Processing files with mode : mode3

Getting files from archive : a

Getting files from archive : b

Getting files from archive : c



我的星火配置:
  • 版本:1.6.2
  • 执行器:20 x 2CPU x 8Go RAM
  • 每个执行器的 yarn 开销内存:800Mo
  • 驱动程序:1CPU x 8Go RAM

  • 我从这些日志中了解到,文件提取执行了 4 次,而不是 1 次!这显然导致我遇到堆空间问题和性能泄漏......

    难道我做错了什么 ?

    编辑:我也尝试使用 modes.foreach(...)而不是 map ,但没有任何改变......

    最佳答案

    你有没有试过通过你的 modes.map结果为 List 构造函数(即 List(modes.map{ /*...*/}) )?有时(我不确定何时)Scala 集合会延迟评估映射,因此如果在 spark 删除缓存之后才评估这些映射,则必须重新计算。

    关于scala - Spark缓存的RDD计算n次,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/54692846/

    24 4 0
    Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
    广告合作:1813099741@qq.com 6ren.com