gpt4 book ai didi

apache-spark - "spark.yarn.executor.memoryOverhead"和 "spark.memory.offHeap.size"的区别

转载 作者:行者123 更新时间:2023-12-03 20:27:16 28 4
gpt4 key购买 nike

我在 yarn 上运行 Spark 。我不明白以下设置有什么区别spark.yarn.executor.memoryOverheadspark.memory.offHeap.size .两者似乎都是分配堆外内存以触发执行程序的设置。我应该使用哪一种?另外,执行程序堆外内存的推荐设置是什么?

非常感谢!

最佳答案

spark.executor.memoryOverhead由 YARN 等资源管理使用,而 spark.memory.offHeap.size由 Spark 核心(内存管理器)使用。关系因版本而有所不同。

Spark 2.4.5 及之前版本:
spark.executor.memoryOverhead应该包括 spark.memory.offHeap.size .这意味着如果您指定 offHeap.size ,您需要手动将此部分添加到 memoryOverhead yarn 。正如您从以下来自 YarnAllocator.scala 的代码中看到的, 当 YARN 请求资源时,它对 offHeap.size 一无所知:

private[yarn] val resource = Resource.newInstance(
executorMemory + memoryOverhead + pysparkWorkerMemory,
executorCores)

但是,该行为在 Spark 3.0 中发生了变化:
spark.executor.memoryOverhead不包括 spark.memory.offHeap.size了。 YARN 将包括 offHeap.size为您请求资源时。来自新 documentation :

Note: Additional memory includes PySpark executor memory (when spark.executor.pyspark.memory is not configured) and memory used by other non-executor processes running in the same container. The maximum memory size of container to running executor is determined by the sum of spark.executor.memoryOverhead, spark.executor.memory, spark.memory.offHeap.size and spark.executor.pyspark.memory.



来自 code你也可以告诉:
private[yarn] val resource: Resource = {
val resource = Resource.newInstance(
executorMemory + executorOffHeapMemory + memoryOverhead + pysparkWorkerMemory, executorCores)
ResourceRequestHelper.setResourceRequests(executorResourceRequests, resource)
logDebug(s"Created resource capability: $resource")
resource
}

有关此更改的更多详细信息,您可以引用 this Pull Request .

对于您的第二个问题,执行器堆外内存的推荐设置是什么?这取决于您的应用程序,您需要进行一些测试。我找到了 this页面有助于进一步解释它:

Off-heap memory is a great way to reduce GC pauses because it's not in the GC's scope. However, it brings an overhead of serialization and deserialization. The latter in its turn makes that the off-heap data can be sometimes put onto heap memory and hence be exposed to GC. Also, the new data format brought by Project Tungsten (array of bytes) helps to reduce the GC overhead. These 2 reasons make that the use of off-heap memory in Apache Spark applications should be carefully planned and, especially, tested.



顺便说一句, spark.yarn.executor.memoryOverhead已弃用并更改为 spark.executor.memoryOverhead ,这在 YARN 和 Kubernetes 中很常见。

关于apache-spark - "spark.yarn.executor.memoryOverhead"和 "spark.memory.offHeap.size"的区别,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/58666517/

28 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com