gpt4 book ai didi

hadoop - Spark 不会在 yarn-cluster 模式下运行 final `saveAsNewAPIHadoopFile` 方法

转载 作者:可可西里 更新时间:2023-11-01 15:26:24 25 4
gpt4 key购买 nike

我编写了一个 Spark 应用程序,它读取一些 CSV 文件(~5-10 GB),转换数据并将数据转换为 HFiles。数据从 HDFS 读取并保存到 HDFS。

当我在 yarn-client 中运行应用程序时,一切似乎都工作正常模式。

但是当我尝试以 yarn-cluster 运行它时应用程序,进程似乎没有运行最终saveAsNewAPIHadoopFile对我已转换并准备好保存的 RDD 采取行动!

这是我的 Spark UI 的快照,您可以在其中看到所有其他作业都已处理:

enter image description here

以及相应的阶段:

enter image description here

这是我应用程序的最后一步,其中 saveAsNewAPIHadoopFile方法被调用:

JavaPairRDD<ImmutableBytesWritable, KeyValue> cells = ...

try {
Connection c = HBaseKerberos.createHBaseConnectionKerberized("userpricipal", "/etc/security/keytabs/user.keytab");
Configuration baseConf = c.getConfiguration();
baseConf.set("hbase.zookeeper.quorum", HBASE_HOST);
baseConf.set("zookeeper.znode.parent", "/hbase-secure");

Job job = Job.getInstance(baseConf, "Test Bulk Load");
HTable table = new HTable(baseConf, "map_data");

HBaseAdmin admin = new HBaseAdmin(baseConf);
HFileOutputFormat2.configureIncrementalLoad(job, table);
Configuration conf = job.getConfiguration();

cells.saveAsNewAPIHadoopFile(outputPath, ImmutableBytesWritable.class, KeyValue.class, HFileOutputFormat2.class, conf);
System.out.println("Finished!!!!!");
} catch (IOException e) {
e.printStackTrace();
System.out.println(e.getMessage());
}

我正在通过 spark-submit --master yarn --deploy-mode cluster --class sparkhbase.BulkLoadAsKeyValue3 --driver-cores 8 --driver-memory 11g --executor-cores 4 --executor-memory 9g /home/myuser/app.jar 运行应用程序

当我查看 HDFS 的输出目录时,它仍然是空的!我在 HDP 2.5 平台中使用 Spark 1.6.3。

所以我在这里有两个问题:这种行为从何而来(可能是内存问题)? yarn-client 和 yarn-cluster 模式有什么区别(我还不明白,文档对我来说也不清楚)?感谢您的帮助!

最佳答案

作业似乎没有开始。在启 Action 业 Spark 之前检查可用资源。我认为可用的资源是不够的。因此,请尝试减少配置中的驱动程序和执行程序内存、驱动程序和执行程序内核。您可以在这里阅读如何计算执行者和驱动程序的资源机会值(value):https://blog.cloudera.com/blog/2015/03/how-to-tune-your-apache-spark-jobs-part-2/

您的作业在客户端模式下运行,因为在客户端模式下驱动器可以使用节点上的所有可用资源。但是在集群模式下资源是有限的。

集群模式和客户端模式的区别:
客户:

Driver runs on a dedicated server (Master node) inside a dedicated process. This means it has all available resources at it's disposal to execute work.
Driver opens up a dedicated Netty HTTP server and distributes the JAR files specified to all Worker nodes (big advantage).
Because the Master node has dedicated resources of it's own, you don't need to "spend" worker resources for the Driver program.
If the driver process dies, you need an external monitoring system to reset it's execution.

集群:

Driver runs on one of the cluster's Worker nodes. The worker is chosen by the Master leader
Driver runs as a dedicated, standalone process inside the Worker.
Driver programs takes up at least 1 core and a dedicated amount of memory from one of the workers (this can be configured).
Driver program can be monitored from the Master node using the --supervise flag and be reset in case it dies.
When working in Cluster mode, all JARs related to the execution of your application need to be publicly available to all the workers. This means you can either manually place them in a shared place or in a folder for each of the workers.

关于hadoop - Spark 不会在 yarn-cluster 模式下运行 final `saveAsNewAPIHadoopFile` 方法,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/46240946/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com