gpt4 book ai didi

docker - 在 Docker 容器中运行 Spark 驱动程序 - 没有从执行程序到驱动程序的连接?

转载 作者:行者123 更新时间:2023-12-02 11:56:07 24 4
gpt4 key购买 nike

更新:问题已解决。 Docker 镜像在这里:docker-spark-submit

我在 Docker 容器内使用 fat jar 运行 Spark-submit。我的独立 Spark 集群在 3 台虚拟机上运行 - 一台主服务器和两台工作服务器。从工作计算机上的执行程序日志中,我看到执行程序具有以下驱动程序 URL:

"--driver-url" "spark://CoarseGrainedScheduler@172.17.0.2:5001"

172.17.0.2实际上是带有驱动程序的容器的地址,而不是容器运行的主机的地址。 Worker 机器无法访问此 IP,因此 Worker 无法与驱动程序进行通信。正如我从StandaloneSchedulerBackend的源代码中看到的,它使用spark.driver.host设置构建driverUrl:

val driverUrl = RpcEndpointAddress(
sc.conf.get("spark.driver.host"),
sc.conf.get("spark.driver.port").toInt,
CoarseGrainedSchedulerBackend.ENDPOINT_NAME).toString

它没有考虑 SPARK_PUBLIC_DNS 环境变量 - 这是正确的吗?在容器中,我无法将spark.driver.host 设置为除容器“内部”IP 地址(本例中为172.17.0.2)之外的任何其他值。当尝试将spark.driver.host设置为主机的IP地址时,我收到如下错误:

WARN Utils: Service 'sparkDriver' could not bind on port 5001. Attempting port 5002.

我尝试将spark.driver.bindAddress设置为主机的IP地址,但遇到了同样的错误。那么,如何配置Spark使用宿主机IP地址而不是Docker容器地址与驱动程序通信呢?

UPD:来自执行器的堆栈跟踪:

ERROR RpcOutboxMessage: Ask timeout before connecting successfully
Exception in thread "main" java.lang.reflect.UndeclaredThrowableException
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1713)
at org.apache.spark.deploy.SparkHadoopUtil.runAsSparkUser(SparkHadoopUtil.scala:66)
at org.apache.spark.executor.CoarseGrainedExecutorBackend$.run(CoarseGrainedExecutorBackend.scala:188)
at org.apache.spark.executor.CoarseGrainedExecutorBackend$.main(CoarseGrainedExecutorBackend.scala:284)
at org.apache.spark.executor.CoarseGrainedExecutorBackend.main(CoarseGrainedExecutorBackend.scala)
Caused by: org.apache.spark.rpc.RpcTimeoutException: Cannot receive any reply in 120 seconds. This timeout is controlled by spark.rpc.askTimeout
at org.apache.spark.rpc.RpcTimeout.org$apache$spark$rpc$RpcTimeout$$createRpcTimeoutException(RpcTimeout.scala:48)
at org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:63)
at org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:59)
at scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:36)
at scala.util.Failure$$anonfun$recover$1.apply(Try.scala:216)
at scala.util.Try$.apply(Try.scala:192)
at scala.util.Failure.recover(Try.scala:216)
at scala.concurrent.Future$$anonfun$recover$1.apply(Future.scala:326)
at scala.concurrent.Future$$anonfun$recover$1.apply(Future.scala:326)
at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:32)
at org.spark_project.guava.util.concurrent.MoreExecutors$SameThreadExecutorService.execute(MoreExecutors.java:293)
at scala.concurrent.impl.ExecutionContextImpl$$anon$1.execute(ExecutionContextImpl.scala:136)
at scala.concurrent.impl.CallbackRunnable.executeWithValue(Promise.scala:40)
at scala.concurrent.impl.Promise$DefaultPromise.tryComplete(Promise.scala:248)
at scala.concurrent.Promise$class.complete(Promise.scala:55)
at scala.concurrent.impl.Promise$DefaultPromise.complete(Promise.scala:153)
at scala.concurrent.Future$$anonfun$map$1.apply(Future.scala:237)
at scala.concurrent.Future$$anonfun$map$1.apply(Future.scala:237)
at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:32)
at scala.concurrent.BatchingExecutor$Batch$$anonfun$run$1.processBatch$1(BatchingExecutor.scala:63)
at scala.concurrent.BatchingExecutor$Batch$$anonfun$run$1.apply$mcV$sp(BatchingExecutor.scala:78)
at scala.concurrent.BatchingExecutor$Batch$$anonfun$run$1.apply(BatchingExecutor.scala:55)
at scala.concurrent.BatchingExecutor$Batch$$anonfun$run$1.apply(BatchingExecutor.scala:55)
at scala.concurrent.BlockContext$.withBlockContext(BlockContext.scala:72)
at scala.concurrent.BatchingExecutor$Batch.run(BatchingExecutor.scala:54)
at scala.concurrent.Future$InternalCallbackExecutor$.unbatchedExecute(Future.scala:601)
at scala.concurrent.BatchingExecutor$class.execute(BatchingExecutor.scala:106)
at scala.concurrent.Future$InternalCallbackExecutor$.execute(Future.scala:599)
at scala.concurrent.impl.CallbackRunnable.executeWithValue(Promise.scala:40)
at scala.concurrent.impl.Promise$DefaultPromise.tryComplete(Promise.scala:248)
at scala.concurrent.Promise$class.tryFailure(Promise.scala:112)
at scala.concurrent.impl.Promise$DefaultPromise.tryFailure(Promise.scala:153)
at org.apache.spark.rpc.netty.NettyRpcEnv.org$apache$spark$rpc$netty$NettyRpcEnv$$onFailure$1(NettyRpcEnv.scala:205)
at org.apache.spark.rpc.netty.NettyRpcEnv$$anon$1.run(NettyRpcEnv.scala:239)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.util.concurrent.TimeoutException: Cannot receive any reply in 120 seconds
... 8 more

最佳答案

所以工作配置是:

  • 将spark.driver.host设置为主机的IP地址
  • 将spark.driver.bindAddress设置为容器的IP地址

工作的 Docker 镜像在这里:docker-spark-submit .

关于docker - 在 Docker 容器中运行 Spark 驱动程序 - 没有从执行程序到驱动程序的连接?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/45489248/

24 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com