gpt4 book ai didi

hadoop - spark 超时可能是由于 binaryFiles() 在 HDFS 中有超过 100 万个文件

转载 作者:可可西里 更新时间:2023-11-01 14:22:02 25 4
gpt4 key购买 nike

我正在通过阅读数百万个 xml 文件

val xmls = sc.binaryFiles(xmlDir)

该操作在本地运行良好,但在 yarn 上运行失败:

 client token: N/A
diagnostics: Application application_1433491939773_0012 failed 2 times due to ApplicationMaster for attempt appattempt_1433491939773_0012_000002 timed out. Failing the application.
ApplicationMaster host: N/A
ApplicationMaster RPC port: -1
queue: default
start time: 1433750951883
final status: FAILED
tracking URL: http://controller01:8088/cluster/app/application_1433491939773_0012
user: ariskk
Exception in thread "main" org.apache.spark.SparkException: Application finished with failed status
at org.apache.spark.deploy.yarn.Client.run(Client.scala:622)
at org.apache.spark.deploy.yarn.Client$.main(Client.scala:647)
at org.apache.spark.deploy.yarn.Client.main(Client.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:497)
at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:569)
at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:166)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:189)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:110)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)

在 hadoops/userlogs 日志上,我经常收到这些消息:

15/06/08 09:15:38 WARN util.AkkaUtils: Error sending message [message = Heartbeat(1,[Lscala.Tuple2;@2b4f336b,BlockManagerId(1, controller01.stratified, 58510))] in 2 attempts
java.util.concurrent.TimeoutException: Futures timed out after [30 seconds]
at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219)
at scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)
at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:107)
at scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53)
at scala.concurrent.Await$.result(package.scala:107)
at org.apache.spark.util.AkkaUtils$.askWithReply(AkkaUtils.scala:195)
at org.apache.spark.executor.Executor$$anon$1.run(Executor.scala:427)

我通过 spark-submit 运行我的 spark 作业,它适用于另一个仅包含 37k 文件的 HDFS 目录。有什么解决办法吗?

最佳答案

好的,在 sparks 邮件列表上获得一些帮助后,我发现有 2 个问题:

  1. src 目录,如果它以/my_dir/给出,它会使 spark 失败并产生心跳问题。相反,它应该作为 hdfs:///my_dir/*

  2. 修复#1 后,日志中出现内存不足错误。这是在由于文件数量而耗尽内存的 yarn 上运行的 spark 驱动程序(显然它将所有文件信息保存在内存中)。所以我使用 --conf spark.driver.memory=8g spark-submit'ed 作业解决了这个问题。

关于hadoop - spark 超时可能是由于 binaryFiles() 在 HDFS 中有超过 100 万个文件,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/30704814/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com