gpt4 book ai didi

apache-spark - Kubernetes 上的 Spark : Executor pods silently get killed

转载 作者:行者123 更新时间:2023-12-04 09:04:51 27 4
gpt4 key购买 nike

我正在 kubernetes 上运行 Spark 作业,并且由于数据量较大,我经常出现“Executor lost”,并且 executor 被杀死,作业失败。我已经做了 kubectl logs -f在所有运行的 executor pod 上,但我从来没有看到任何异常被抛出(我希望像 OutOfMemoryError 之类的东西)。 Pod 只是突然停止计算,然后直接被移除,因此它们甚至不会停留在 Error 中。状态以便能够挖掘和排除故障。他们只是消失了。
我应该如何解决这个问题?在我看来,Kubernetes 本身会杀死 Pod,因为我认为它们超出了某些边界,但据我所知,Pod 应该在 Evicted 中。状态(或者他们不应该?)
这似乎与内存使用有关,因为当我打开时spark.executor.memory我的工作运行到完成(但随后执行者少得多,导致速度低得多)。
使用 local[*] 运行作业时作为主机,即使内存设置低得多,它也能运行完成。
跟进 1
我开始工作时只有一个执行者并做了一个 kubectl logs -f在 executor pod 上,并观察驱动程序的输出(在客户端模式下运行)。首先,驱动程序上有“Executor lost”消息,然后 executor pod 就退出了,没有任何异常或错误消息。
后续2
当 executor 死亡时,驱动程序的日志如下所示:

20/08/18 10:36:40 INFO TaskSchedulerImpl: Removed TaskSet 15.0, whose tasks have all completed, from pool
20/08/18 10:36:40 INFO TaskSetManager: Starting task 3.0 in stage 18.0 (TID 1554, 10.244.1.64, executor 1, partition 3, NODE_LOCAL, 7717 bytes)
20/08/18 10:36:40 INFO DAGScheduler: ShuffleMapStage 15 (parquet at DataTasks.scala:208) finished in 5.913 s
20/08/18 10:36:40 INFO DAGScheduler: looking for newly runnable stages
20/08/18 10:36:40 INFO DAGScheduler: running: Set(ShuffleMapStage 18)
20/08/18 10:36:40 INFO DAGScheduler: waiting: Set(ShuffleMapStage 20, ShuffleMapStage 21, ResultStage 22)
20/08/18 10:36:40 INFO DAGScheduler: failed: Set()
20/08/18 10:36:40 INFO BlockManagerInfo: Added broadcast_11_piece0 in memory on 10.244.1.64:43809 (size: 159.0 KiB, free: 2.2 GiB)
20/08/18 10:36:40 INFO MapOutputTrackerMasterEndpoint: Asked to send map output locations for shuffle 1 to 10.93.111.35:20221
20/08/18 10:36:41 INFO MapOutputTrackerMasterEndpoint: Asked to send map output locations for shuffle 3 to 10.93.111.35:20221
20/08/18 10:36:49 INFO KubernetesClusterSchedulerBackend$KubernetesDriverEndpoint: Disabling executor 1.
20/08/18 10:36:49 INFO DAGScheduler: Executor lost: 1 (epoch 12)
20/08/18 10:36:49 INFO BlockManagerMasterEndpoint: Trying to remove executor 1 from BlockManagerMaster.
20/08/18 10:36:49 INFO BlockManagerMasterEndpoint: Removing block manager BlockManagerId(1, 10.244.1.64, 43809, None)
20/08/18 10:36:49 INFO BlockManagerMaster: Removed 1 successfully in removeExecutor
20/08/18 10:36:49 INFO DAGScheduler: Shuffle files lost for executor: 1 (epoch 12)
在执行器上,它看起来像这样:
20/08/18 10:36:40 INFO Executor: Running task 3.0 in stage 18.0 (TID 1554)
20/08/18 10:36:40 INFO TorrentBroadcast: Started reading broadcast variable 11 with 1 pieces (estimated total size 4.0 MiB)
20/08/18 10:36:40 INFO MemoryStore: Block broadcast_11_piece0 stored as bytes in memory (estimated size 159.0 KiB, free 2.2 GiB)
20/08/18 10:36:40 INFO TorrentBroadcast: Reading broadcast variable 11 took 7 ms
20/08/18 10:36:40 INFO MemoryStore: Block broadcast_11 stored as values in memory (estimated size 457.3 KiB, free 2.2 GiB)
20/08/18 10:36:40 INFO MapOutputTrackerWorker: Don't have map outputs for shuffle 1, fetching them
20/08/18 10:36:40 INFO MapOutputTrackerWorker: Doing the fetch; tracker endpoint = NettyRpcEndpointRef(spark://MapOutputTracker@node01.maas:34271)
20/08/18 10:36:40 INFO MapOutputTrackerWorker: Got the output locations
20/08/18 10:36:40 INFO ShuffleBlockFetcherIterator: Getting 30 (142.3 MiB) non-empty blocks including 30 (142.3 MiB) local and 0 (0.0 B) host-local and 0 (0.0 B) remote blocks
20/08/18 10:36:40 INFO ShuffleBlockFetcherIterator: Started 0 remote fetches in 0 ms
20/08/18 10:36:41 INFO CodeGenerator: Code generated in 3.082897 ms
20/08/18 10:36:41 INFO CodeGenerator: Code generated in 5.132359 ms
20/08/18 10:36:41 INFO MapOutputTrackerWorker: Don't have map outputs for shuffle 3, fetching them
20/08/18 10:36:41 INFO MapOutputTrackerWorker: Doing the fetch; tracker endpoint = NettyRpcEndpointRef(spark://MapOutputTracker@node01.maas:34271)
20/08/18 10:36:41 INFO MapOutputTrackerWorker: Got the output locations
20/08/18 10:36:41 INFO ShuffleBlockFetcherIterator: Getting 0 (0.0 B) non-empty blocks including 0 (0.0 B) local and 0 (0.0 B) host-local and 0 (0.0 B) remote blocks
20/08/18 10:36:41 INFO ShuffleBlockFetcherIterator: Started 0 remote fetches in 0 ms
20/08/18 10:36:41 INFO CodeGenerator: Code generated in 6.770762 ms
20/08/18 10:36:41 INFO CodeGenerator: Code generated in 3.150645 ms
20/08/18 10:36:41 INFO CodeGenerator: Code generated in 2.81799 ms
20/08/18 10:36:41 INFO CodeGenerator: Code generated in 2.989827 ms
20/08/18 10:36:41 INFO CodeGenerator: Code generated in 3.024777 ms
20/08/18 10:36:41 INFO CodeGenerator: Code generated in 4.32011 ms
然后,执行器退出。
奇怪的是:Stage 18.0从Task 3.0开始,而不是从1.0开始
后续3
我现在将执行器日志级别更改为 DEBUG在 executor 退出之前,我注意到了一些有趣的事情:
20/08/18 14:19:26 DEBUG TaskMemoryManager: Task 1155 acquired 64.0 KiB for org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter@5050038d
20/08/18 14:19:26 DEBUG TaskMemoryManager: Task 1155 acquired 64.0 KiB for org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter@4ef2dc4a
20/08/18 14:19:26 DEBUG TaskMemoryManager: Task 1155 release 64.0 KiB from org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter@4ef2dc4a
20/08/18 14:19:26 DEBUG TaskMemoryManager: Task 1155 acquired 64.0 MiB for org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter@5050038d
20/08/18 14:19:26 DEBUG TaskMemoryManager: Task 1155 acquired 128.0 KiB for org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter@5050038d 20/08/18 14:19:26 DEBUG TaskMemoryManager: Task 1155 release 64.0 KiB from org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter@5050038d
20/08/18 14:19:26 DEBUG TaskMemoryManager: Task 1155 acquired 256.0 KiB for org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter@5050038d 20/08/18 14:19:26 DEBUG TaskMemoryManager: Task 1155 release 128.0 KiB from org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter@5050038d 20/08/18 14:19:26 DEBUG TaskMemoryManager: Task 1155 acquired 512.0 KiB for org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter@5050038d 20/08/18 14:19:26 DEBUG TaskMemoryManager: Task 1155 release 256.0 KiB from org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter@5050038d 20/08/18 14:19:26 DEBUG TaskMemoryManager: Task 1155 acquired 1024.0 KiB for org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter@5050038d20/08/18 14:19:26 DEBUG TaskMemoryManager: Task 1155 release 512.0 KiB from org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter@5050038d 20/08/18 14:19:26 DEBUG TaskMemoryManager: Task 1155 acquired 2.0 MiB for org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter@5050038d
20/08/18 14:19:26 DEBUG TaskMemoryManager: Task 1155 release 1024.0 KiB from org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter@5050038d20/08/18 14:19:26 DEBUG TaskMemoryManager: Task 1155 acquired 4.0 MiB for org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter@5050038d
20/08/18 14:19:26 DEBUG TaskMemoryManager: Task 1155 release 2.0 MiB from org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter@5050038d
20/08/18 14:19:27 DEBUG TaskMemoryManager: Task 1155 acquired 8.0 MiB for org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter@5050038d
20/08/18 14:19:27 DEBUG TaskMemoryManager: Task 1155 release 4.0 MiB from org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter@5050038d
20/08/18 14:19:27 DEBUG TaskMemoryManager: Task 1155 acquired 16.0 MiB for org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter@5050038d
20/08/18 14:19:27 DEBUG TaskMemoryManager: Task 1155 release 8.0 MiB from org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter@5050038d
20/08/18 14:19:27 DEBUG TaskMemoryManager: Task 1155 acquired 32.0 MiB for org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter@5050038d
20/08/18 14:19:27 DEBUG TaskMemoryManager: Task 1155 release 16.0 MiB from org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter@5050038d
20/08/18 14:19:29 DEBUG TaskMemoryManager: Task 1155 acquired 64.0 MiB for org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter@5050038d
20/08/18 14:19:30 DEBUG TaskMemoryManager: Task 1155 acquired 64.0 MiB for org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter@5050038d
20/08/18 14:19:30 DEBUG TaskMemoryManager: Task 1155 release 32.0 MiB from org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter@5050038d
20/08/18 14:19:34 DEBUG TaskMemoryManager: Task 1155 acquired 64.0 MiB for org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter@5050038d
20/08/18 14:19:34 DEBUG TaskMemoryManager: Task 1155 acquired 128.0 MiB for org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter@5050038d 20/08/18 14:19:34 DEBUG TaskMemoryManager: Task 1155 release 64.0 MiB from org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter@5050038d
20/08/18 14:19:36 DEBUG TaskMemoryManager: Task 1155 acquired 64.0 MiB for org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter@5050038d
20/08/18 14:19:36 DEBUG TaskMemoryManager: Task 1155 acquired 64.0 MiB for org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter@5050038d
20/08/18 14:19:37 DEBUG TaskMemoryManager: Task 1155 acquired 256.0 MiB for org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter@5050038d 20/08/18 14:19:37 DEBUG TaskMemoryManager: Task 1155 release 128.0 MiB from org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter@5050038d 20/08/18 14:19:37 DEBUG TaskMemoryManager: Task 1155 acquired 64.0 MiB for org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter@5050038d
20/08/18 14:19:38 DEBUG TaskMemoryManager: Task 1155 acquired 64.0 MiB for org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter@5050038d
20/08/18 14:19:38 DEBUG TaskMemoryManager: Task 1155 acquired 64.0 MiB for org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter@5050038d
20/08/18 14:19:39 DEBUG TaskMemoryManager: Task 1155 acquired 64.0 MiB for org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter@5050038d
20/08/18 14:19:39 DEBUG TaskMemoryManager: Task 1155 acquired 512.0 MiB for org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter@5050038d
我通过 spark.executor.memory 给了 executor 4GB 内存,这些分配加起来为 1344MB。使用 4GB 内存和默认内存拆分设置,40% 为 1400MB。
我可以以某种方式限制 UnsafeExternalSorter 的内存量吗?需要?
后续 4
我遇到了一个罕见的情况,由于某种原因,Spark 没有杀死“完成的”执行程序,我看到 pod 是 OOMKilled .看来 spark.executor.memory设置 pod 的请求内存和 Spark 执行器中的内存配置。

最佳答案

后续行动 4 就是答案。我用 kubectl get pod -w 再次运行了这项工作我看到 executor pod 得到 OOMKilled .我现在正在运行 spark.kubernetes.memoryOverheadFactor=0.5spark.memory.fraction=0.2 , 调整 spark.executor.memory如此之高以至于每个节点几乎没有启动一个执行程序,我设置了 spark.executor.cores到每个节点的核心数减 1。这样,它就会运行。
我还调整了我的算法,因为它有一个很大的分区偏斜,并且必须做一些不容易并行化的计算,这导致了很多 shuffle。

关于apache-spark - Kubernetes 上的 Spark : Executor pods silently get killed,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/63466193/

27 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com