- html - 出于某种原因,IE8 对我的 Sass 文件中继承的 html5 CSS 不友好?
- JMeter 在响应断言中使用 span 标签的问题
- html - 在 :hover and :active? 上具有不同效果的 CSS 动画
- html - 相对于居中的 html 内容固定的 CSS 重复背景?
我正在 kubernetes 上运行 Spark 作业,并且由于数据量较大,我经常出现“Executor lost”,并且 executor 被杀死,作业失败。我已经做了 kubectl logs -f
在所有运行的 executor pod 上,但我从来没有看到任何异常被抛出(我希望像 OutOfMemoryError
之类的东西)。 Pod 只是突然停止计算,然后直接被移除,因此它们甚至不会停留在 Error
中。状态以便能够挖掘和排除故障。他们只是消失了。
我应该如何解决这个问题?在我看来,Kubernetes 本身会杀死 Pod,因为我认为它们超出了某些边界,但据我所知,Pod 应该在 Evicted
中。状态(或者他们不应该?)
这似乎与内存使用有关,因为当我打开时spark.executor.memory
我的工作运行到完成(但随后执行者少得多,导致速度低得多)。
使用 local[*]
运行作业时作为主机,即使内存设置低得多,它也能运行完成。
跟进 1
我开始工作时只有一个执行者并做了一个 kubectl logs -f
在 executor pod 上,并观察驱动程序的输出(在客户端模式下运行)。首先,驱动程序上有“Executor lost”消息,然后 executor pod 就退出了,没有任何异常或错误消息。
后续2
当 executor 死亡时,驱动程序的日志如下所示:
20/08/18 10:36:40 INFO TaskSchedulerImpl: Removed TaskSet 15.0, whose tasks have all completed, from pool
20/08/18 10:36:40 INFO TaskSetManager: Starting task 3.0 in stage 18.0 (TID 1554, 10.244.1.64, executor 1, partition 3, NODE_LOCAL, 7717 bytes)
20/08/18 10:36:40 INFO DAGScheduler: ShuffleMapStage 15 (parquet at DataTasks.scala:208) finished in 5.913 s
20/08/18 10:36:40 INFO DAGScheduler: looking for newly runnable stages
20/08/18 10:36:40 INFO DAGScheduler: running: Set(ShuffleMapStage 18)
20/08/18 10:36:40 INFO DAGScheduler: waiting: Set(ShuffleMapStage 20, ShuffleMapStage 21, ResultStage 22)
20/08/18 10:36:40 INFO DAGScheduler: failed: Set()
20/08/18 10:36:40 INFO BlockManagerInfo: Added broadcast_11_piece0 in memory on 10.244.1.64:43809 (size: 159.0 KiB, free: 2.2 GiB)
20/08/18 10:36:40 INFO MapOutputTrackerMasterEndpoint: Asked to send map output locations for shuffle 1 to 10.93.111.35:20221
20/08/18 10:36:41 INFO MapOutputTrackerMasterEndpoint: Asked to send map output locations for shuffle 3 to 10.93.111.35:20221
20/08/18 10:36:49 INFO KubernetesClusterSchedulerBackend$KubernetesDriverEndpoint: Disabling executor 1.
20/08/18 10:36:49 INFO DAGScheduler: Executor lost: 1 (epoch 12)
20/08/18 10:36:49 INFO BlockManagerMasterEndpoint: Trying to remove executor 1 from BlockManagerMaster.
20/08/18 10:36:49 INFO BlockManagerMasterEndpoint: Removing block manager BlockManagerId(1, 10.244.1.64, 43809, None)
20/08/18 10:36:49 INFO BlockManagerMaster: Removed 1 successfully in removeExecutor
20/08/18 10:36:49 INFO DAGScheduler: Shuffle files lost for executor: 1 (epoch 12)
在执行器上,它看起来像这样:
20/08/18 10:36:40 INFO Executor: Running task 3.0 in stage 18.0 (TID 1554)
20/08/18 10:36:40 INFO TorrentBroadcast: Started reading broadcast variable 11 with 1 pieces (estimated total size 4.0 MiB)
20/08/18 10:36:40 INFO MemoryStore: Block broadcast_11_piece0 stored as bytes in memory (estimated size 159.0 KiB, free 2.2 GiB)
20/08/18 10:36:40 INFO TorrentBroadcast: Reading broadcast variable 11 took 7 ms
20/08/18 10:36:40 INFO MemoryStore: Block broadcast_11 stored as values in memory (estimated size 457.3 KiB, free 2.2 GiB)
20/08/18 10:36:40 INFO MapOutputTrackerWorker: Don't have map outputs for shuffle 1, fetching them
20/08/18 10:36:40 INFO MapOutputTrackerWorker: Doing the fetch; tracker endpoint = NettyRpcEndpointRef(spark://MapOutputTracker@node01.maas:34271)
20/08/18 10:36:40 INFO MapOutputTrackerWorker: Got the output locations
20/08/18 10:36:40 INFO ShuffleBlockFetcherIterator: Getting 30 (142.3 MiB) non-empty blocks including 30 (142.3 MiB) local and 0 (0.0 B) host-local and 0 (0.0 B) remote blocks
20/08/18 10:36:40 INFO ShuffleBlockFetcherIterator: Started 0 remote fetches in 0 ms
20/08/18 10:36:41 INFO CodeGenerator: Code generated in 3.082897 ms
20/08/18 10:36:41 INFO CodeGenerator: Code generated in 5.132359 ms
20/08/18 10:36:41 INFO MapOutputTrackerWorker: Don't have map outputs for shuffle 3, fetching them
20/08/18 10:36:41 INFO MapOutputTrackerWorker: Doing the fetch; tracker endpoint = NettyRpcEndpointRef(spark://MapOutputTracker@node01.maas:34271)
20/08/18 10:36:41 INFO MapOutputTrackerWorker: Got the output locations
20/08/18 10:36:41 INFO ShuffleBlockFetcherIterator: Getting 0 (0.0 B) non-empty blocks including 0 (0.0 B) local and 0 (0.0 B) host-local and 0 (0.0 B) remote blocks
20/08/18 10:36:41 INFO ShuffleBlockFetcherIterator: Started 0 remote fetches in 0 ms
20/08/18 10:36:41 INFO CodeGenerator: Code generated in 6.770762 ms
20/08/18 10:36:41 INFO CodeGenerator: Code generated in 3.150645 ms
20/08/18 10:36:41 INFO CodeGenerator: Code generated in 2.81799 ms
20/08/18 10:36:41 INFO CodeGenerator: Code generated in 2.989827 ms
20/08/18 10:36:41 INFO CodeGenerator: Code generated in 3.024777 ms
20/08/18 10:36:41 INFO CodeGenerator: Code generated in 4.32011 ms
然后,执行器退出。
DEBUG
在 executor 退出之前,我注意到了一些有趣的事情:
20/08/18 14:19:26 DEBUG TaskMemoryManager: Task 1155 acquired 64.0 KiB for org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter@5050038d
20/08/18 14:19:26 DEBUG TaskMemoryManager: Task 1155 acquired 64.0 KiB for org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter@4ef2dc4a
20/08/18 14:19:26 DEBUG TaskMemoryManager: Task 1155 release 64.0 KiB from org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter@4ef2dc4a
20/08/18 14:19:26 DEBUG TaskMemoryManager: Task 1155 acquired 64.0 MiB for org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter@5050038d
20/08/18 14:19:26 DEBUG TaskMemoryManager: Task 1155 acquired 128.0 KiB for org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter@5050038d 20/08/18 14:19:26 DEBUG TaskMemoryManager: Task 1155 release 64.0 KiB from org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter@5050038d
20/08/18 14:19:26 DEBUG TaskMemoryManager: Task 1155 acquired 256.0 KiB for org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter@5050038d 20/08/18 14:19:26 DEBUG TaskMemoryManager: Task 1155 release 128.0 KiB from org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter@5050038d 20/08/18 14:19:26 DEBUG TaskMemoryManager: Task 1155 acquired 512.0 KiB for org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter@5050038d 20/08/18 14:19:26 DEBUG TaskMemoryManager: Task 1155 release 256.0 KiB from org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter@5050038d 20/08/18 14:19:26 DEBUG TaskMemoryManager: Task 1155 acquired 1024.0 KiB for org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter@5050038d20/08/18 14:19:26 DEBUG TaskMemoryManager: Task 1155 release 512.0 KiB from org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter@5050038d 20/08/18 14:19:26 DEBUG TaskMemoryManager: Task 1155 acquired 2.0 MiB for org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter@5050038d
20/08/18 14:19:26 DEBUG TaskMemoryManager: Task 1155 release 1024.0 KiB from org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter@5050038d20/08/18 14:19:26 DEBUG TaskMemoryManager: Task 1155 acquired 4.0 MiB for org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter@5050038d
20/08/18 14:19:26 DEBUG TaskMemoryManager: Task 1155 release 2.0 MiB from org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter@5050038d
20/08/18 14:19:27 DEBUG TaskMemoryManager: Task 1155 acquired 8.0 MiB for org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter@5050038d
20/08/18 14:19:27 DEBUG TaskMemoryManager: Task 1155 release 4.0 MiB from org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter@5050038d
20/08/18 14:19:27 DEBUG TaskMemoryManager: Task 1155 acquired 16.0 MiB for org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter@5050038d
20/08/18 14:19:27 DEBUG TaskMemoryManager: Task 1155 release 8.0 MiB from org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter@5050038d
20/08/18 14:19:27 DEBUG TaskMemoryManager: Task 1155 acquired 32.0 MiB for org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter@5050038d
20/08/18 14:19:27 DEBUG TaskMemoryManager: Task 1155 release 16.0 MiB from org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter@5050038d
20/08/18 14:19:29 DEBUG TaskMemoryManager: Task 1155 acquired 64.0 MiB for org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter@5050038d
20/08/18 14:19:30 DEBUG TaskMemoryManager: Task 1155 acquired 64.0 MiB for org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter@5050038d
20/08/18 14:19:30 DEBUG TaskMemoryManager: Task 1155 release 32.0 MiB from org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter@5050038d
20/08/18 14:19:34 DEBUG TaskMemoryManager: Task 1155 acquired 64.0 MiB for org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter@5050038d
20/08/18 14:19:34 DEBUG TaskMemoryManager: Task 1155 acquired 128.0 MiB for org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter@5050038d 20/08/18 14:19:34 DEBUG TaskMemoryManager: Task 1155 release 64.0 MiB from org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter@5050038d
20/08/18 14:19:36 DEBUG TaskMemoryManager: Task 1155 acquired 64.0 MiB for org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter@5050038d
20/08/18 14:19:36 DEBUG TaskMemoryManager: Task 1155 acquired 64.0 MiB for org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter@5050038d
20/08/18 14:19:37 DEBUG TaskMemoryManager: Task 1155 acquired 256.0 MiB for org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter@5050038d 20/08/18 14:19:37 DEBUG TaskMemoryManager: Task 1155 release 128.0 MiB from org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter@5050038d 20/08/18 14:19:37 DEBUG TaskMemoryManager: Task 1155 acquired 64.0 MiB for org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter@5050038d
20/08/18 14:19:38 DEBUG TaskMemoryManager: Task 1155 acquired 64.0 MiB for org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter@5050038d
20/08/18 14:19:38 DEBUG TaskMemoryManager: Task 1155 acquired 64.0 MiB for org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter@5050038d
20/08/18 14:19:39 DEBUG TaskMemoryManager: Task 1155 acquired 64.0 MiB for org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter@5050038d
20/08/18 14:19:39 DEBUG TaskMemoryManager: Task 1155 acquired 512.0 MiB for org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter@5050038d
我通过
spark.executor.memory
给了 executor 4GB 内存,这些分配加起来为 1344MB。使用 4GB 内存和默认内存拆分设置,40% 为 1400MB。
UnsafeExternalSorter
的内存量吗?需要?
OOMKilled
.看来
spark.executor.memory
设置 pod 的请求内存和 Spark 执行器中的内存配置。
最佳答案
后续行动 4 就是答案。我用 kubectl get pod -w
再次运行了这项工作我看到 executor pod 得到 OOMKilled
.我现在正在运行 spark.kubernetes.memoryOverheadFactor=0.5
和 spark.memory.fraction=0.2
, 调整 spark.executor.memory
如此之高以至于每个节点几乎没有启动一个执行程序,我设置了 spark.executor.cores
到每个节点的核心数减 1。这样,它就会运行。
我还调整了我的算法,因为它有一个很大的分区偏斜,并且必须做一些不容易并行化的计算,这导致了很多 shuffle。
关于apache-spark - Kubernetes 上的 Spark : Executor pods silently get killed,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/63466193/
我编写了一个 Android 应用程序,如果 BroadcastReceiver 收到通知,它会将铃声模式从任何模式更改为正常模式。 它在 Android < 8 上运行良好。 它也适用于我的 And
在函数的 python 代码中,当设置如下内容时: def myfunction(silent=False, x, y) silent=False 有什么作用? 最佳答案 使用您当前的代码: def
PackageInstaller ( https://developer.android.com/reference/android/content/pm/PackageInstaller.html
我正在尝试使用 the angular-oauth2-oidc Silent Refresh实现与在 IdentityServer4 服务器中配置的隐式流相结合。我有一个在 ng new ng-and
我正在开发一个需要静默打印的 Web 应用程序,即无需用户参与。实现这一目标的最佳方法是什么?它不喜欢严格使用 Javascript、Flash 和/或 AIR 来完成。我见过的最接近的是 Java
在 vim 中我有这个 nmap nmap ,mu : marks ABCDEFGHIJKLMNOPQRSTUVWXYZ 如果我没有 Upper 标记并尝试 ,mu 我得到 E283: No mar
我目前正在实现一个 JavaScript 框架。在这里,我想捕获页面上的鼠标和触摸事件,而不干扰其他事件处理。例如: DOM: A Button JS(使用 jQuery): $("#capture
我想在Python脚本中提取7-Zip存档。它工作得很好,除了它会吐出提取细节(在我的情况下这是很大的)。 有没有一种方法可以在提取时避免这些详细信息?我没有找到7z.exe的任何“静音”命令行选项。
我正在尝试覆盖 web 表单的 onError 事件处理程序,以允许在表单内处理“从客户端检测到潜在危险的 Request.Form 值”类型的错误,而不是在应用程序级别的错误处理程序中结束。 我找到
调试静默JS功能失败 我编写了一个 Node 连接函数,该函数连接成功,将数据插入数据库,但随后无法执行处理任何进一步错误的代码。 在这种情况下我应该如何调试? 这是我的相关脚本: var conne
我今天遇到了一个非常严重的错误,这是一个 MWE: #include class X { public: X() { std::cout << "Default" << std::endl; }
跟进 this问题,有没有办法在不提示用户任何操作的情况下在 android 中启动 Intent ? 现在,我正在检索这样的图像: public void changeImage(View view
我正在为我在 Eclipse 3.4.2 中开发的项目使用 Ant 构建脚本。它实际上是一个 Flex 4 项目(使用 Flash Builder Eclipse 插件 )。 一切正常。但是有一天发生
我一直使用 cocoaasyncsocket 作为使用 asyncsocket 的 Windows .net 服务器的客户端。我正在使用 ProtocolBuffers 编码消息。这些共同构成了一套很
我正在开发一个内部 Google Chrome 扩展,它需要一种方法来启动将当前页面打印到打印机。我不希望出现默认的打印对话框(因此,javascript:window.print() 是不可能的)。
我有一个非常简单的 Angular ui 路线结构: index.html: 应用程序配置: $stateProvider .state('app', {
假设我定义了一个 pod,它只运行几段代码然后退出。我需要确保这个 pod 在允许其他 pod 运行之前退出。实现这一点的最佳方法是什么? 我曾经通过执行网络请求来检查 pod 是否准备就绪,例如一旦
我们有一个 2006 年的旧应用程序,我们想使用组策略在命令行中卸载,但我无法静默卸载。 这有效。当然我需要点击下一步卸载: "C:\App\Setup.exe" /uninst 但这不是。我看到了
假设您正在从头开始设计一个数学库(在 JS 中):常见的 Vector2/3/4、Matrix2/3/4、四元数 等等(WebGL 应用程序的标准内容)。处理错误输入的最佳方法是什么? (除以零、反转
StackOverflow 有很多关于浮点表示的主题,关于异常、截断、精度问题。我一直试图解释这个,但仍然没有弄明白。 from operator import add, sub, mul, div
我是一名优秀的程序员,十分优秀!