gpt4 book ai didi

caching - Spark会自动缓存一些结果吗?

转载 作者:行者123 更新时间:2023-12-02 10:32:12 26 4
gpt4 key购买 nike

我运行一个操作两次,第二次运行所需的时间很少,因此我怀疑 Spark 会自动缓存一些结果。但我确实找到了任何来源。

我使用的是 Spark1.4。

doc = sc.textFile('...')
doc_wc = doc.flatMap(lambda x: re.split('\W', x))\
.filter(lambda x: x != '') \
.map(lambda word: (word, 1)) \
.reduceByKey(lambda x,y: x+y)
%%time
doc_wc.take(5) # first time
# CPU times: user 10.7 ms, sys: 425 µs, total: 11.1 ms
# Wall time: 4.39 s

%%time
doc_wc.take(5) # second time
# CPU times: user 6.13 ms, sys: 276 µs, total: 6.41 ms
# Wall time: 151 ms

最佳答案

来自the documentation :

Spark also automatically persists some intermediate data in shuffle operations (e.g. reduceByKey), even without users calling persist. This is done to avoid recomputing the entire input if a node fails during the shuffle. We still recommend users call persist on the resulting RDD if they plan to reuse it.

底层文件系统还将缓存对磁盘的访问。

关于caching - Spark会自动缓存一些结果吗?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/31180592/

26 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com