gpt4 book ai didi

java - spark 中的 cache() 是改变 RDD 的状态还是创建一个新的?

转载 作者:搜寻专家 更新时间:2023-11-01 01:50:22 24 4
gpt4 key购买 nike

这个问题是我上一个问题的后续问题 What happens if I cache the same RDD twice in Spark .

当在RDD上调用cache()时,RDD的状态是否发生了变化(为了便于使用,返回的RDD只是this)还是一个新的RDD是创建包装现有的吗?

下面的代码会发生什么:

// Init
JavaRDD<String> a = ... // some initialise and calculation functions.
JavaRDD<String> b = a.cache();
JavaRDD<String> c = b.cache();

// Case 1, will 'a' be calculated twice in this case
// because it's before the cache layer:
a.saveAsTextFile(somePath);
a.saveAsTextFile(somePath);

// Case 2, will the data of the calculation of 'a'
// be cached in the memory twice in this case
// (once as 'b' and once as 'c'):
c.saveAsTextFile(somePath);

最佳答案

When calling cache() on a RDD, does the state of the RDD changed (and the returned RDD is just this for ease of use) or a new RDD is created the wrapped the existing one

The same RDD is returned :

/**
* Mark this RDD for persisting using the specified level.
*
* @param newLevel the target storage level
* @param allowOverride whether to override any existing level with the new one
*/
private def persist(newLevel: StorageLevel, allowOverride: Boolean): this.type = {
// TODO: Handle changes of StorageLevel
if (storageLevel != StorageLevel.NONE && newLevel != storageLevel && !allowOverride) {
throw new UnsupportedOperationException(
"Cannot change storage level of an RDD after it was already assigned a level")
}
// If this is the first time this RDD is marked for persisting, register it
// with the SparkContext for cleanups and accounting. Do this only once.
if (storageLevel == StorageLevel.NONE) {
sc.cleaner.foreach(_.registerRDDForCleanup(this))
sc.persistRDD(this)
}
storageLevel = newLevel
this
}

缓存不会对上述 RDD 造成任何副作用。如果它已经被标记为持久化,则什么也不会发生。如果不是,唯一的副作用是将其注册到 SparkContext,其中副作用不在 RDD 本身,而是上下文。

编辑:

查看JavaRDD.cache ,似乎底层调用会导致分配另一个JavaRDD:

/** Persist this RDD with the default storage level (`MEMORY_ONLY`). */
def cache(): JavaRDD[T] = wrapRDD(rdd.cache())

wrapRDD调用JavaRDD.fromRDD的地方:

object JavaRDD {

implicit def fromRDD[T: ClassTag](rdd: RDD[T]): JavaRDD[T] = new JavaRDD[T](rdd)
implicit def toRDD[T](rdd: JavaRDD[T]): RDD[T] = rdd.rdd
}

这将导致分配一个新的 JavaRDD。也就是说,RDD[T] 的内部实例将保持不变。

关于java - spark 中的 cache() 是改变 RDD 的状态还是创建一个新的?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/36196522/

24 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com