TensorFlow 适用于 Slurm Interactive Session 但不适用于 Slurm Job-6ren

TensorFlow 适用于 Slurm Interactive Session 但不适用于 Slurm Job

转载作者：行者123 更新时间：2023-12-04 08:21:24

我正在尝试让一些 TensorFlow/Jax 代码在 Slurm 集群的 GPU 上运行。当我请求交互式 GPU session 并运行我的代码时，一切正常。但是当我提交我的 Slurm 作业时，我收到了一个经典的 failed call to cuInit: CUDA_ERROR_NO_DEVICE: no CUDA-capable device is detected TensorFlow 错误。
我怀疑这是 Slurm 问题，但我不确定。有谁知道错误的根源可能是什么或如何解决它？
交互式 session 成功:

rschaeffer@boslogin04 ~/PehlevanLab-Dopamine$ srun --pty -p gpu_test -t 0-06:00 --mem 8000 --gres=gpu:1 /bin/bash
srun: job 11386921 queued and waiting for resources
srun: job 11386921 has been allocated resources
rschaeffer@holygpu2c0701 ~/PehlevanLab-Dopamine$ source dopamine_venv/bin/activate
(dopamine_venv) rschaeffer@holygpu2c0701 ~/PehlevanLab-Dopamine$ module load cuda/10.1.243-fasrc01
(dopamine_venv) rschaeffer@holygpu2c0701 ~/PehlevanLab-Dopamine$ echo "Loaded CUDA"
Loaded CUDA
(dopamine_venv) rschaeffer@holygpu2c0701 ~/PehlevanLab-Dopamine$ module load cudnn/7.6.5.32_cuda10.1-fasrc01
(dopamine_venv) rschaeffer@holygpu2c0701 ~/PehlevanLab-Dopamine$ echo "Loaded cuDNN"
Loaded cuDNN
(dopamine_venv) rschaeffer@holygpu2c0701 ~/PehlevanLab-Dopamine$ python3 -u exploration_train_c51_agent.py --agent=c51 --game=Pong --seed=1
2020-12-28 13:51:42.825084: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.10.1
2020-12-28 13:52:09.842508: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcuda.so.1
2020-12-28 13:52:09.892837: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1716] Found device 0 with properties:
pciBusID: 0000:2f:00.0 name: Tesla V100-PCIE-32GB computeCapability: 7.0
coreClock: 1.38GHz coreCount: 80 deviceMemorySize: 31.75GiB deviceMemoryBandwidth: 836.37GiB/s
2020-12-28 13:52:09.893056: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.10.1
2020-12-28 13:52:09.911570: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcublas.so.10
2020-12-28 13:52:09.923173: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcufft.so.10
2020-12-28 13:52:09.931751: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcurand.so.10
2020-12-28 13:52:09.944798: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcusolver.so.10
2020-12-28 13:52:09.954852: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcusparse.so.10
2020-12-28 13:52:09.973711: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudnn.so.7
2020-12-28 13:52:09.978247: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1858] Adding visible gpu devices: 0
WARNING:tensorflow:From /n/home02/rschaeffer/PehlevanLab-Dopamine/dopamine_venv/lib/python3.7/site-packages/tensorflow/python/compat/v2_compat.py:96: disable_resource_variables (from tensorflow.python.ops.variable_scope) is deprecated and will be removed in a future version.
Instructions for updating:
non-resource variables are not supported in the long term
WARNING:tensorflow:From /n/home02/rschaeffer/PehlevanLab-Dopamine/dopamine_venv/lib/python3.7/site-packages/tensorflow/python/compat/v2_compat.py:96: disable_resource_variables (from tensorflow.python.ops.variable_scope) is deprecated and will be removed in a future version.
Instructions for updating:
non-resource variables are not supported in the long term
2020-12-28 13:52:10.292352: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN)to use the following CPU instructions in performance-critical operations:  AVX2 AVX512F FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2020-12-28 13:52:10.300610: I tensorflow/core/platform/profile_utils/cpu_utils.cc:104] CPU Frequency: 2600000000 Hz
2020-12-28 13:52:10.300787: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x559d70c19f30 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
2020-12-28 13:52:10.300839: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): Host, Default Version
2020-12-28 13:52:10.441301: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x559d70c29eb0 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices:
2020-12-28 13:52:10.441420: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): Tesla V100-PCIE-32GB, Compute Capability 7.0
2020-12-28 13:52:10.442716: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1716] Found device 0 with properties:
pciBusID: 0000:2f:00.0 name: Tesla V100-PCIE-32GB computeCapability: 7.0
coreClock: 1.38GHz coreCount: 80 deviceMemorySize: 31.75GiB deviceMemoryBandwidth: 836.37GiB/s
2020-12-28 13:52:10.442884: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.10.1
2020-12-28 13:52:10.442966: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcublas.so.10
2020-12-28 13:52:10.443026: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcufft.so.10
2020-12-28 13:52:10.443093: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcurand.so.10
2020-12-28 13:52:10.443151: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcusolver.so.10
2020-12-28 13:52:10.443209: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcusparse.so.10
2020-12-28 13:52:10.443277: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudnn.so.7
2020-12-28 13:52:10.445472: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1858] Adding visible gpu devices: 0
2020-12-28 13:52:10.445562: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.10.1
2020-12-28 13:52:10.944749: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1257] Device interconnect StreamExecutor with strength 1 edge matrix:
2020-12-28 13:52:10.944891: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1263]      0
2020-12-28 13:52:10.944944: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1276] 0:   N
2020-12-28 13:52:10.947443: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1402] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 30132 MB memory) -> physical GPU (device: 0, name: Tesla V100-PCIE-32GB, pci bus id: 0000:2f:00.0, compute capability: 7.0)
Will train agent, please be patient, may be a while...
Steps executed: 19522 Episode length: 1266 Return: -19.0

史莱姆脚本:

#!/bin/bash
#SBATCH -p gpu         # partition (one of shared, gpu, test, gpu_test, pehlevan_gpu)
#SBATCH -n 1                    # one
#SBATCH --mem=32G
#SBATCH --time=99:99:99         # total run time limit (HH:MM:SS)
#SBATCH --mail-user=rschaeffer
#SBATCH --mail-type=FAIL

agent=${1}
game=${2}
seed=${3}
echo $agent
echo $game
echo $seed

source dopamine_venv/bin/activate
module load cuda/10.1.243-fasrc01
echo "Loaded CUDA"
module load cudnn/7.6.5.32_cuda10.1-fasrc01
echo "Loaded cuDNN"


python3 -u exploration_train_c51_agent.py --agent="${agent}" --game="${game}" --seed="${seed}"

提交 Slurm 作业时出错:

2020-12-28 13:39:57.037537: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.10.1
2020-12-28 13:40:26.129789: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcuda.so.1
2020-12-28 13:40:26.151975: E tensorflow/stream_executor/cuda/cuda_driver.cc:314] failed call to cuInit: CUDA_ERROR_NO_DEVICE: no CUDA-capable device is detected
2020-12-28 13:40:26.152026: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:169] retrieving CUDA diagnostic information for host: holygpu2c0714.rc.fas.harvard.edu
2020-12-28 13:40:26.152036: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:176] hostname: holygpu2c0714.rc.fas.harvard.edu
2020-12-28 13:40:26.152177: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:200] libcuda reported version is: 455.23.5
2020-12-28 13:40:26.152231: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:204] kernel reported version is: 455.23.5
2020-12-28 13:40:26.152238: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:310] kernel version seems to match DSO: 455.23.5

最佳答案

交互式 session 和批处理作业之间的明显区别在于，在前者中，您使用 --gres=gpu:1 请求 GPU。，而在后者中，您没有。
这足以防止您的工作访问任何 GPU。
所以添加

#SBATCH --gres=gpu:1

到您的提交脚本，例如就在 #SBATCH -p gpu 之后.

关于TensorFlow 适用于 Slurm Interactive Session 但不适用于 Slurm Job，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/65482637/

文章推荐： discord.js - 如何通过机器人离开带有公会ID的服务器

文章推荐： r - 在 R 中，我可以过滤至少 1 个值满足阈值的所有列

session - 是否为每个shinyR session 分配了唯一的 session id/ session key ？
是否为每个 Shiny session 分配了 session ID/ session key (如果部署在 Shiny 服务器上)？如果是，我如何访问该信息？我已阅读文档here然而上网查了一下，并
session - koajs session - session 存储在哪里？
我正在使用 this koajs session 模块。我检查了源代码，但我真的无法理解。我想知道它保存 session 数据的位置，因为我没有看到创建的文件，并且当服务器重新启动时， sessi
session - 粘性 session / session 亲和性负载平衡策略的优缺点？
实现高可扩展性的一种方法是使用网络负载平衡在多个服务器之间分配处理负载。这种方法提出的一个挑战是服务器是否具有状态意识 - 将用户状态存储在“ session ”中。此问题的一个解决方案是“粘性
session - session 亲和性和粘性 session 之间的区别？
在负载平衡服务器的上下文中， session 亲和性和粘性 session 之间有什么区别？最佳答案我见过这些术语可以互换使用，但有不同的实现方式: 在第一个响应中发送 cookie，然后在后续响
session - 设计Web应用程序: Session or session-less
我希望其他人向我解释哪种方法更好:使用 session 或设计无 session 。我们正在开始开发一个新的 Web 应用程序，但尚未决定要遵循什么路径。无 session 设计在我看来更可取: 优
session - 如何在tomcat中创建新 session 并保留旧 session ？
现在用户在他的权限中有很多角色，我将允许他点击 href 并在新窗口中扮演另一个角色。每个角色都有自己的 session 。既然浏览器打开窗口不能用新 session 打开，我必须在服务器端想办法。
session - Node 、 session 存储删除过期 session
我正在尝试为express.js Node 应用程序实现 session 存储我的问题是: 如何删除具有浏览器 session 生命周期的 cookie(根据连接文档标记有 expires = fal
session - session 的最佳实践( gorilla / session )
在开始在 golang 中使用 session 之前，我需要回答一些问题 session 示例 import "github.com/gorilla/sessions" var store = ses
php - 检测到服务 "session"的循环引用，路径 : "session -> session.flash_bag -> session"
我读过 Namespaced Attributes . 我尝试使用此功能: #src/Controller/CartController.php public function addProduct(
session - 修改 CakePHP session 的 session cookie 到期和 session 超时
我正在努力完成以下工作: 根据用户的类型更改用户的 session cookie 到期日期。我有一个 CakePHP Web 应用程序，其中我使用 CakePHP session 创建了我的身份验证
session - 使用有状态 session Bean 跟踪用户的 session
这是我在这里的第一个问题，我希望我做对了。我需要处理一个 Java EE 项目，所以在开始之前，我会尝试做一些简单的事情，看看我是否能做到。我坚持使用有状态 session Bean。这是问题:
session - ColdFusion session 与 J2EE session
ColdFusion session 与 J2EE session 相比有什么优势吗？ ColdFusion session documentation提到了 J2EE session 的优点，但没有
session - 创建 session 时在Grails中创建 session 变量
在执行任何任务之前，我需要准确地在创建 session 时创建一个 session 范围变量(因为我的所有任务都需要一个初始 session 范围变量才能运行)。因为，创建 session 时，gra
session - JWT 和每个用户一个(!) session /无并发 session
我们当前的应用使用 HTTP session ，我们希望将其替换为 JWT。该设置仅允许每个用户进行一次 session 。这意味着: 用户在设备 1 上登录用户已在设备 1 上登录(已创建新 s
session - 文件中的 session 和数据库中的 session 之间的区别
很难说出这里问的是什么。这个问题是含糊的、模糊的、不完整的、过于宽泛的或修辞性的，无法以目前的形式得到合理的回答。如需帮助澄清此问题以便重新打开它，visit the help center 。已关
session - 如何关闭除当前 session 之外的用户打开的所有( Plone ) session ？
假设我在两个或更多设备上打开了两个或更多用户 session (同一用户没有管理员权限)。在当前 session 中，如果我注销，是否意味着所有其他 session 也会关闭？如果没有，有没有办法通
session - 粘性 session 和 session 复制
我正在评估在 tomcat 中使用带有 session 复制的粘性 session 的情况。根据我的初步评估，我认为如果我们启用 session 复制，那么在一个 tomcat 节点中启动的 sess
session - Gorilla session 文件系统存储找不到 session 文件
我开始使用 golang 和 Angular2 构建一个常规的网络应用程序，最重要的是我试图在 auth0.com 的帮助下保护我的登录.我从 here 下载快速入门代码并尝试运行代码，它运行了一段时
java - spring Controller 方法中相同类型的两个对象( session 和非 session )非 session 正在替换 session
我在 Spring Controller 中有一个方法，它接受两个相同类型的参数其中一个来自 session ，另一个来自表单提交(UI)。问题是在 Controller 方法中我的非 sessio
session - 身份验证为匿名的用户已尝试访问拥有的 session
在我登录之前，我可以点击我的安全约束目录之外的任何内容。如果我尝试转到安全约束目录内的某个位置，它会将我重定向到表单登录页面。如您所料。登录后，我可以继续我的业务，并访问我的安全约束内外的资源。

行者123

个人简介

我是一名优秀的程序员,十分优秀！

作者热门文章

滴滴打车优惠券免费领取

全站热门文章

首页

博学

6Ren·AI

商城

TensorFlow 适用于 Slurm Interactive Session 但不适用于 Slurm Job