- html - 出于某种原因,IE8 对我的 Sass 文件中继承的 html5 CSS 不友好?
- JMeter 在响应断言中使用 span 标签的问题
- html - 在 :hover and :active? 上具有不同效果的 CSS 动画
- html - 相对于居中的 html 内容固定的 CSS 重复背景?
有两次,当我使用 4x1080ti 完成模型训练时,服务器宕机了。为什么服务器崩溃了?
我收到系统日志并发现有关 Nvidia 驱动程序或 GPU 的问题。
系统日志:(和 nvidia-bug-report.log)
[第二个]
Sep 6 21:11:41 gpu-8-server-intesight kernel: [31429.221258] NVRM: RmInitAdapter failed! (0x30:0xffff:682)
Sep 6 21:11:41 gpu-8-server-intesight kernel: [31429.221337] NVRM: rm_init_adapter failed for device bearing minor number 0
Sep 6 21:13:54 gpu-8-server-intesight kernel: [31562.154256] NVRM: RmInitAdapter failed! (0x30:0xffff:682)
Sep 6 21:13:54 gpu-8-server-intesight kernel: [31562.154306] NVRM: rm_init_adapter failed for device bearing minor number 1
[第一个]
Sep 6 02:48:40 gpu-8-server-intesight kernel: [557998.990374] NVRM: GPU at PCI:0000:04:00: GPU-bc54db68-a3cb-54e9-7287-b95c69e41cf1
Sep 6 02:48:40 gpu-8-server-intesight kernel: [557998.990375] NVRM: GPU Board Serial Number:
Sep 6 02:48:40 gpu-8-server-intesight kernel: [557998.990376] NVRM: Xid (PCI:0000:04:00): 79, GPU has fallen off the bus.
Sep 6 02:48:40 gpu-8-server-intesight kernel: [557998.990377] NVRM: GPU at 0000:04:00.0 has fallen off the bus.
Sep 6 02:48:40 gpu-8-server-intesight kernel: [557998.990377] NVRM: GPU is on Board .
Sep 6 02:48:40 gpu-8-server-intesight kernel: [557998.990655] NVRM: A GPU crash dump has been created. If possible, please run
Sep 6 02:48:40 gpu-8-server-intesight kernel: [557998.990655] NVRM: nvidia-bug-report.sh as root to collect this data before
Sep 6 02:48:40 gpu-8-server-intesight kernel: [557998.990655] NVRM: the NVIDIA kernel module is unloaded.
Sep 6 02:48:41 gpu-8-server-intesight kernel: [557999.884383] NVRM: GPU at 0000:04:00.0 has fallen off the bus.
Sep 6 02:48:41 gpu-8-server-intesight kernel: [557999.901942] NVRM: A GPU crash dump has been created. If possible, please run
Sep 6 02:48:41 gpu-8-server-intesight kernel: [557999.901942] NVRM: nvidia-bug-report.sh as root to collect this data before
Sep 6 02:48:41 gpu-8-server-intesight kernel: [557999.901942] NVRM: the NVIDIA kernel module is unloaded.
Sep 6 02:48:41 gpu-8-server-intesight kernel: [558000.356948] NVRM: RmInitAdapter failed! (0x30:0xffff:682)
Sep 6 02:48:41 gpu-8-server-intesight kernel: [558000.444379] NVRM: rm_init_adapter failed for device bearing minor number 0
Sep 6 02:48:45 gpu-8-server-intesight kernel: [558004.604173] NVRM: request_irq() failed (-22)
Sep 6 02:48:48 gpu-8-server-intesight kernel: [558007.497475] NVRM: RmInitAdapter failed! (0x23:0x56:468)
Sep 6 02:48:48 gpu-8-server-intesight kernel: [558007.497489] NVRM: rm_init_adapter failed for device bearing minor number 0
Sep 6 02:48:50 gpu-8-server-intesight kernel: [558008.878985] NVRM: request_irq() failed (-22)
Sep 6 02:48:53 gpu-8-server-intesight kernel: [558011.735642] NVRM: RmInitAdapter failed! (0x23:0x56:468)
Sep 6 02:48:53 gpu-8-server-intesight kernel: [558011.735658] NVRM: rm_init_adapter failed for device bearing minor number 0
Sep 6 02:48:54 gpu-8-server-intesight kernel: [558013.108772] NVRM: request_irq() failed (-22)
Sep 6 02:48:55 gpu-8-server-intesight kernel: [558013.757168] BUG: unable to handle kernel paging request at 0000000132081000
Sep 6 02:48:55 gpu-8-server-intesight kernel: [558013.757173] IP: [] kmem_cache_alloc+0x77/0x1f0
Sep 6 02:48:55 gpu-8-server-intesight kernel: [558013.757175] PGD 10357d8067 PUD 0
最佳答案
我们遇到过这个问题。据我所知,您的设置与多个 GPU 和 X99 主板非常相似。我们设法通过在引导内核参数中设置 pcie_aspm=off
来减轻错误。如果您在提供的 nvidia 错误报告日志中搜索“aspm”,您会注意到以下内容:
[ 0.167842] ACPI FADT declares the system doesn't support PCIe ASPM, so disable it
[ 0.278085] acpi PNP0A03:03: FADT indicates ASPM is unsupported, using BIOS configuration
[ 0.282583] acpi PNP0A08:00: FADT indicates ASPM is unsupported, using BIOS configuration
[ 2.795337] r8169 0000:0a:00.0: can't disable ASPM; OS doesn't have ASPM control
我们的 GPU 服务器目前仍然存在一些问题,但这可能会有所帮助。
我最初是在这个thread上发现这个想法的
更新:我们仍然偶尔会收到 RmInitAdapter
消息,但我们不再有任何稳定性问题。作为记录,我们现在正在运行 Nvidia 的 387.34 驱动程序,并且我们有以下启动参数:
pcie_aspm=off rcutree.rcu_idle_gp_delay=1
附带说明一下,我们还有一个基于 X299 主板的较新的四 GPU 盒,我们也有类似的问题。
相关:
关于ubuntu - NVRM : RmInitAdapter failed: Xid: 79, GPU 已经脱离总线,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/46107222/
有两次,当我使用 4x1080ti 完成模型训练时,服务器宕机了。为什么服务器崩溃了? 我收到系统日志并发现有关 Nvidia 驱动程序或 GPU 的问题。 系统日志:(和 nvidia-bug-re
我已经安装了英伟达驱动程序 390 添加 后GTX 560 钛 在运行 Kubuntu 20.04 LTS 的 Intel Core i5 12600K PC 上。 重新启动后,我收到以下错误: $
我是一名优秀的程序员,十分优秀!