gpt4 book ai didi

apache-spark - Google Dataproc - 经常与执行者断开连接

转载 作者:行者123 更新时间:2023-12-03 09:19:03 26 4
gpt4 key购买 nike

我正在使用 Dataproc 通过 Spark-shell 在集群上运行 Spark 命令。我经常收到错误/警告消息,表明我与执行者失去了连接。这些消息如下所示:

[Stage 6:>                                                          (0 + 2) / 2]16/01/20 10:10:24 ERROR     org.apache.spark.scheduler.cluster.YarnScheduler: Lost executor 5 on spark-cluster-femibyte-w-0.c.gcebook-1039.internal: remote Rpc client disassociated
16/01/20 10:10:24 WARN akka.remote.ReliableDeliverySupervisor: Association with remote system [akka.tcp://sparkExecutor@spark-cluster- femibyte-w-0.c.gcebook-1039.internal:60599] has failed, address is now gated for [5000] ms. Reason: [Disassociated]
16/01/20 10:10:24 WARN org.apache.spark.scheduler.TaskSetManager: Lost task 0.2 in stage 6.0 (TID 17, spark-cluster-femibyte-w-0.c.gcebook-1039.internal): ExecutorLostFailure (executor 5 lost)
16/01/20 10:10:24 WARN org.apache.spark.scheduler.TaskSetManager: Lost task 1.2 in stage 6.0 (TID 16, spark-cluster-femibyte-w-0.c.gcebook-1039.internal): ExecutorLostFailure (executor 5 lost)

...

这是另一个示例:

20 10:51:43 ERROR org.apache.spark.scheduler.cluster.YarnScheduler: Lost executor 2 on spark-cluster-femibyte-w-1.c.gcebook-1039.internal: remote Rpc client disassociated
16/01/20 10:51:43 WARN akka.remote.ReliableDeliverySupervisor: Association with remote system [akka.tcp://<a href="https://stackoverflow.com/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="5e2d2e3f2c351b263b3d2b2a312c1e2d2e3f2c35733d322b2d2a3b2c73383b33373c272a3b7329736f703d70393d3b3c313135736f6e6d677037302a3b2c303f32" rel="noreferrer noopener nofollow">[email protected]</a>:58745] has failed, address is now gated for [5000] ms. Reason: [Disassociated]
16/01/20 10:51:43 WARN org.apache.spark.scheduler.TaskSetManager: Lost task 1.0 in stage 4.0 (TID 5, spark-cluster-femibyte-w-1.c.gcebook-1039.internal): ExecutorLostFailure (executor 2 lost)
16/01/20 10:51:43 WARN org.apache.spark.scheduler.TaskSetManager: Lost task 0.0 in stage 4.0 (TID 4, spark-cluster-femibyte-w-1.c.gcebook-1039.internal): ExecutorLostFailure (executor 2 lost)
16/01/20 10:51:43 WARN org.apache.spark.ExecutorAllocationManager: Attempted to mark unknown executor 2 idle

这正常吗?我能做些什么来防止这种情况发生吗?

最佳答案

如果作业本身没有失败,并且您没有看到与实际任务失败相关的其他传播错误(至少据我从问题中发布的内容可以看出),那么您很可能是只是看到无害但known to be spammy issue in core Spark ;这里的关键是 Spark 动态分配在作业期间放弃未充分使用的执行器,并根据需要重新分配它们。他们最初未能抑制其中执行者丢失的部分,但我们已经进行了测试以确保它对实际工作没有不良影响。

这是a googlegroups thread重点介绍 Spark on YARN 的一些行为细节。

要检查是否确实是动态分配导致消息,请尝试运行:

spark-shell --conf spark.dynamicAllocation.enabled=false \
--conf spark.executor.instances=99999

或者,如果您通过 gcloud beta dataproc jobs 提交作业,则:

gcloud beta dataproc jobs submit spark \
--properties spark.dynamicAllocation.enabled=false,spark.executor.instances=99999

如果您确实看到网络中断或其他 Dataproc 错误导致主服务器/工作线程解除关联(当它不是应用程序端 OOM 或其他情况时),您可以直接向 Dataproc 团队发送电子邮件:[email protected] ; Beta 版不会成为潜在破坏行为的借口(当然,我们希望消除我们在 Beta 期间可能尚未发现的棘手的边缘情况错误)。

关于apache-spark - Google Dataproc - 经常与执行者断开连接,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/34897150/

26 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com