gpt4 book ai didi

TensorFlow 适用于 Slurm Interactive Session 但不适用于 Slurm Job

转载 作者:行者123 更新时间:2023-12-04 08:21:24 24 4
gpt4 key购买 nike

我正在尝试让一些 TensorFlow/Jax 代码在 Slurm 集群的 GPU 上运行。当我请求交互式 GPU session 并运行我的代码时,一切正常。但是当我提交我的 Slurm 作业时,我收到了一个经典的 failed call to cuInit: CUDA_ERROR_NO_DEVICE: no CUDA-capable device is detected TensorFlow 错误。
我怀疑这是 Slurm 问题,但我不确定。有谁知道错误的根源可能是什么或如何解决它?
交互式 session 成功:

rschaeffer@boslogin04 ~/PehlevanLab-Dopamine$ srun --pty -p gpu_test -t 0-06:00 --mem 8000 --gres=gpu:1 /bin/bash
srun: job 11386921 queued and waiting for resources
srun: job 11386921 has been allocated resources
rschaeffer@holygpu2c0701 ~/PehlevanLab-Dopamine$ source dopamine_venv/bin/activate
(dopamine_venv) rschaeffer@holygpu2c0701 ~/PehlevanLab-Dopamine$ module load cuda/10.1.243-fasrc01
(dopamine_venv) rschaeffer@holygpu2c0701 ~/PehlevanLab-Dopamine$ echo "Loaded CUDA"
Loaded CUDA
(dopamine_venv) rschaeffer@holygpu2c0701 ~/PehlevanLab-Dopamine$ module load cudnn/7.6.5.32_cuda10.1-fasrc01
(dopamine_venv) rschaeffer@holygpu2c0701 ~/PehlevanLab-Dopamine$ echo "Loaded cuDNN"
Loaded cuDNN
(dopamine_venv) rschaeffer@holygpu2c0701 ~/PehlevanLab-Dopamine$ python3 -u exploration_train_c51_agent.py --agent=c51 --game=Pong --seed=1
2020-12-28 13:51:42.825084: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.10.1
2020-12-28 13:52:09.842508: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcuda.so.1
2020-12-28 13:52:09.892837: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1716] Found device 0 with properties:
pciBusID: 0000:2f:00.0 name: Tesla V100-PCIE-32GB computeCapability: 7.0
coreClock: 1.38GHz coreCount: 80 deviceMemorySize: 31.75GiB deviceMemoryBandwidth: 836.37GiB/s
2020-12-28 13:52:09.893056: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.10.1
2020-12-28 13:52:09.911570: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcublas.so.10
2020-12-28 13:52:09.923173: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcufft.so.10
2020-12-28 13:52:09.931751: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcurand.so.10
2020-12-28 13:52:09.944798: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcusolver.so.10
2020-12-28 13:52:09.954852: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcusparse.so.10
2020-12-28 13:52:09.973711: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudnn.so.7
2020-12-28 13:52:09.978247: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1858] Adding visible gpu devices: 0
WARNING:tensorflow:From /n/home02/rschaeffer/PehlevanLab-Dopamine/dopamine_venv/lib/python3.7/site-packages/tensorflow/python/compat/v2_compat.py:96: disable_resource_variables (from tensorflow.python.ops.variable_scope) is deprecated and will be removed in a future version.
Instructions for updating:
non-resource variables are not supported in the long term
WARNING:tensorflow:From /n/home02/rschaeffer/PehlevanLab-Dopamine/dopamine_venv/lib/python3.7/site-packages/tensorflow/python/compat/v2_compat.py:96: disable_resource_variables (from tensorflow.python.ops.variable_scope) is deprecated and will be removed in a future version.
Instructions for updating:
non-resource variables are not supported in the long term
2020-12-28 13:52:10.292352: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN)to use the following CPU instructions in performance-critical operations: AVX2 AVX512F FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2020-12-28 13:52:10.300610: I tensorflow/core/platform/profile_utils/cpu_utils.cc:104] CPU Frequency: 2600000000 Hz
2020-12-28 13:52:10.300787: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x559d70c19f30 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
2020-12-28 13:52:10.300839: I tensorflow/compiler/xla/service/service.cc:176] StreamExecutor device (0): Host, Default Version
2020-12-28 13:52:10.441301: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x559d70c29eb0 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices:
2020-12-28 13:52:10.441420: I tensorflow/compiler/xla/service/service.cc:176] StreamExecutor device (0): Tesla V100-PCIE-32GB, Compute Capability 7.0
2020-12-28 13:52:10.442716: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1716] Found device 0 with properties:
pciBusID: 0000:2f:00.0 name: Tesla V100-PCIE-32GB computeCapability: 7.0
coreClock: 1.38GHz coreCount: 80 deviceMemorySize: 31.75GiB deviceMemoryBandwidth: 836.37GiB/s
2020-12-28 13:52:10.442884: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.10.1
2020-12-28 13:52:10.442966: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcublas.so.10
2020-12-28 13:52:10.443026: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcufft.so.10
2020-12-28 13:52:10.443093: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcurand.so.10
2020-12-28 13:52:10.443151: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcusolver.so.10
2020-12-28 13:52:10.443209: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcusparse.so.10
2020-12-28 13:52:10.443277: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudnn.so.7
2020-12-28 13:52:10.445472: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1858] Adding visible gpu devices: 0
2020-12-28 13:52:10.445562: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.10.1
2020-12-28 13:52:10.944749: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1257] Device interconnect StreamExecutor with strength 1 edge matrix:
2020-12-28 13:52:10.944891: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1263] 0
2020-12-28 13:52:10.944944: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1276] 0: N
2020-12-28 13:52:10.947443: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1402] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 30132 MB memory) -> physical GPU (device: 0, name: Tesla V100-PCIE-32GB, pci bus id: 0000:2f:00.0, compute capability: 7.0)
Will train agent, please be patient, may be a while...
Steps executed: 19522 Episode length: 1266 Return: -19.0
史莱姆脚本:
#!/bin/bash
#SBATCH -p gpu # partition (one of shared, gpu, test, gpu_test, pehlevan_gpu)
#SBATCH -n 1 # one
#SBATCH --mem=32G
#SBATCH --time=99:99:99 # total run time limit (HH:MM:SS)
#SBATCH --mail-user=rschaeffer
#SBATCH --mail-type=FAIL

agent=${1}
game=${2}
seed=${3}
echo $agent
echo $game
echo $seed

source dopamine_venv/bin/activate
module load cuda/10.1.243-fasrc01
echo "Loaded CUDA"
module load cudnn/7.6.5.32_cuda10.1-fasrc01
echo "Loaded cuDNN"


python3 -u exploration_train_c51_agent.py --agent="${agent}" --game="${game}" --seed="${seed}"
提交 Slurm 作业时出错:
2020-12-28 13:39:57.037537: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.10.1
2020-12-28 13:40:26.129789: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcuda.so.1
2020-12-28 13:40:26.151975: E tensorflow/stream_executor/cuda/cuda_driver.cc:314] failed call to cuInit: CUDA_ERROR_NO_DEVICE: no CUDA-capable device is detected
2020-12-28 13:40:26.152026: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:169] retrieving CUDA diagnostic information for host: holygpu2c0714.rc.fas.harvard.edu
2020-12-28 13:40:26.152036: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:176] hostname: holygpu2c0714.rc.fas.harvard.edu
2020-12-28 13:40:26.152177: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:200] libcuda reported version is: 455.23.5
2020-12-28 13:40:26.152231: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:204] kernel reported version is: 455.23.5
2020-12-28 13:40:26.152238: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:310] kernel version seems to match DSO: 455.23.5

最佳答案

交互式 session 和批处理作业之间的明显区别在于,在前者中,您使用 --gres=gpu:1 请求 GPU。 ,而在后者中,您没有。
这足以防止您的工作访问任何 GPU。
所以添加

#SBATCH --gres=gpu:1
到您的提交脚本,例如就在 #SBATCH -p gpu 之后.

关于TensorFlow 适用于 Slurm Interactive Session 但不适用于 Slurm Job,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/65482637/

24 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com