gpt4 book ai didi

ssh - srun : error: Slurm controller not responding, sleeping and retrying

转载 作者:行者123 更新时间:2023-12-03 08:42:08 33 4
gpt4 key购买 nike

在Slurm中运行以下命令:

$ srun -J FRD_gpu --partition=gpu --gres=gpu:1 --time=0-02:59:00 --mem=2000 --ntasks=1 --cpus-per-task=1 --pty /bin/bash -i

返回以下错误:
srun: error: Slurm controller not responding, sleeping and retrying.

Slurm Controller 似乎启动了:
$ scontrol ping
Slurmctld(primary) at narvi-install is UP

任何想法为什么以及如何解决这个问题?
$ scontrol -V
slurm 18.08.8

系统信息: gcc version 4.8.5 20150623 (Red Hat 4.8.5-39) (GCC)
$ sinfo 
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
normal up 7-00:00:00 1 drain* me99
normal up 7-00:00:00 3 down* me[64-65,97]
normal up 7-00:00:00 1 drain me89
normal up 7-00:00:00 23 mix me[55,67,86,88,90-94,96,98,100-101],na[27,41-42,44-45,47-49,51-52]
normal up 7-00:00:00 84 alloc me[56-63,66,68-74,76-81,83-85,87,95,102,153-158],na[01-26,28-40,43,46,50,53-60]
normal up 7-00:00:00 3 idle me[82,151-152]
test* up 4:00:00 1 drain* me99
test* up 4:00:00 3 down* me[64-65,97]
test* up 4:00:00 2 drain me[04,89]
test* up 4:00:00 27 mix me[55,67,86,88,90-94,96,98,100-101,248,260],meg[11-12],na[27,41-42,44-45,47-49,51-52]
test* up 4:00:00 130 alloc me[56-63,66,68-74,76-81,83-85,87,95,102,153-158,233-247,249-259,261-280],na[01-26,28-40,43,46,50,53-60]
test* up 4:00:00 14 idle me[01-03,50-54,82,151-152],meg10,nag[01,14]
grid up 7-00:00:00 10 mix na[27,41-42,44-45,47-49,51-52]
grid up 7-00:00:00 42 alloc na[01-26,28-32,43,46,50,53-60]
gpu up 7-00:00:00 15 mix meg[11-12],nag[02-10,12-13,16-17]
gpu up 7-00:00:00 4 idle meg10,nag[01,11,15]

最佳答案

如果您肯定Slurm Controller 已启动并正在运行(例如sinfo命令正在响应),则SSH到分配给您的作业的计算节点,然后运行scontrol ping以测试与主服务器的连接性。如果失败,请查找防火墙规则以阻止从计算节点到主节点的连接。

关于ssh - srun : error: Slurm controller not responding, sleeping and retrying,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/60882335/

33 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com