gpt4 book ai didi

apache-spark - Spark 如何处理比 Spark 存储大得多的数据?

转载 作者:行者123 更新时间:2023-12-04 03:37:35 27 4
gpt4 key购买 nike

目前正在学习 Spark 的类(class)并了解到执行者的定义:

Each executor will hold a chunk of the data to be processed. Thischunk is called a Spark partition. It is a collection of rows thatsits on one physical machine in the cluster. Executors are responsiblefor carrying out the work assigned by the driver. Each executor isresponsible for two things: (1) execute code assigned by the driver,(2) report the state of the computation back to the driver

我想知道如果spark集群的存储量小于需要处理的数据会怎样?执行器将如何获取数据以驻留在集群中的物理机器上?

enter image description here

同样的问题也适用于流数据,即未绑定(bind)的数据。 Spark 是否将所有传入数据保存在磁盘上?

最佳答案

Apache Spark FAQ简单提一下Spark可能采用的两种策略:

Does my data need to fit in memory to use Spark?

No. Spark's operators spill data to disk if it does not fit in memory,allowing it to run well on any sized data. Likewise, cached datasetsthat do not fit in memory are either spilled to disk or recomputed onthe fly when needed, as determined by the RDD's storage level.

尽管 Spark 默认使用所有可用内存,但可以将其配置为仅使用磁盘运行作业。

在 Matei 关于 Spark 的博士论文 (An Architecture for Fast and General Data Processing on Large Clusters) 的 2.6.4 内存不足情况下的行为 中,基准测试了由于可用内存量减少而导致的性能影响。

Behavior with Insufficient Memory

在实践中,您通常不会持久化 100TB 的源数据帧,而只会持久化重复使用的聚合或中间计算。

关于apache-spark - Spark 如何处理比 Spark 存储大得多的数据?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/66596227/

27 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com