gpt4 book ai didi

hadoop - Apache Spark : The number of cores vs. 执行者数量

转载 作者:可可西里 更新时间:2023-11-01 14:06:04 30 4
gpt4 key购买 nike

我试图了解在 YARN 上运行 Spark 作业时核心数量与执行程序数量之间的关系。

测试环境如下:

  • 数据节点数:3
  • 数据节点机器规范:
    • CPU:Core i7-4790(核心数:4,线程数:8)
    • 内存:32GB (8GB x 4)
    • 硬盘:8TB (2TB x 4)
  • 网络:1Gb

  • Spark 版本:1.0.0

  • Hadoop 版本:2.4.0 (Hortonworks HDP 2.1)

  • Spark 作业流程:sc.textFile -> filter -> map -> filter -> mapToPair -> reduceByKey -> map -> saveAsTextFile

  • 输入数据

    • 类型:单个文本文件
    • 大小:165GB
    • 行数:454,568,833
  • 输出

    • 第二次过滤后的行数:310,640,717
    • 结果文件行数:99,848,268
    • 结果文件大小:41GB

作业使用以下配置运行:

  1. --master yarn-client --executor-memory 19G --executor-cores 7 --num-executors 3(每个数据节点的执行器数,使用与核心一样多)

  2. --master yarn-client --executor-memory 19G --executor-cores 4 --num-executors 3(核心数减少)

  3. --master yarn-client --executor-memory 4G --executor-cores 2 --num-executors 12(核心少,执行者多)

    <

耗时:

  1. 50 分 15 秒

  2. 55 分 48 秒

  3. 31 分 23 秒

令我惊讶的是,(3) 更快。
我认为 (1) 会更快,因为在洗牌时执行者之间的通信会更少。
尽管 (1) 的核心数少于 (3),但核心数不是关键因素,因为 2) 确实表现良好。

(以下内容是在 pwilmot 的回答后添加的。)

相关信息,性能监控截屏如下:

  • (1) 的 Ganglia 数据节点摘要 - 作业于 04:37 开始。

Ganglia data node summary for (1)

  • (3) 的 Ganglia 数据节点摘要 - 作业于 19:47 开始。请忽略该时间之前的图表。

Ganglia data node summary for (3)

该图大致分为两部分:

  • 首先:从开始到 reduceByKey:CPU 密集型,无网络事件
  • 第二:在 reduceByKey 之后:CPU 降低,网络 I/O 完成。

如图所示,(1) 可以使用给定的 CPU 能力。所以,这可能不是线程数的问题。

如何解释这个结果?

最佳答案

To hopefully make all of this a little more concrete, here’s a worked example of configuring a Spark app to use as much of the cluster aspossible: Imagine a cluster with six nodes running NodeManagers, eachequipped with 16 cores and 64GB of memory. The NodeManager capacities,yarn.nodemanager.resource.memory-mb andyarn.nodemanager.resource.cpu-vcores, should probably be set to 63 *1024 = 64512 (megabytes) and 15 respectively. We avoid allocating 100%of the resources to YARN containers because the node needs someresources to run the OS and Hadoop daemons. In this case, we leave agigabyte and a core for these system processes. Cloudera Manager helpsby accounting for these and configuring these YARN propertiesautomatically.

The likely first impulse would be to use --num-executors 6--executor-cores 15 --executor-memory 63G. However, this is the wrong approach because:

63GB + the executor memory overhead won’t fit within the 63GB capacityof the NodeManagers. The application master will take up a core on oneof the nodes, meaning that there won’t be room for a 15-core executoron that node. 15 cores per executor can lead to bad HDFS I/Othroughput.

A better option would be to use --num-executors 17--executor-cores 5 --executor-memory 19G. Why?

This config results in three executors on all nodes except for the onewith the AM, which will have two executors.--executor-memory was derived as (63/3 executors per node) = 21. 21 * 0.07 = 1.47. 21 – 1.47 ~ 19.

Cloudera 博客中的一篇文章给出了解释,How-to: Tune Your Apache Spark Jobs (Part 2) .

关于hadoop - Apache Spark : The number of cores vs. 执行者数量,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/24622108/

30 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com