gpt4 book ai didi

ubuntu - SLURM 不遵循请求的资源

转载 作者:行者123 更新时间:2023-12-04 18:29:52 25 4
gpt4 key购买 nike

我有以下名为“test.sub”的提交脚本:

#!/bin/bash
#SBATCH --workdir=./
#SBATCH -o test.out
#SBATCH --partition=debug
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --requeue
#SBATCH --job-name=test

x=0

while [ $x -le 100 ]; do
echo "Test $x" >> test.out
sleep 100
x=$(($x+1))
done

当我提交此作业脚本时,作业确实开始了。但是,当我使用 scontrol show job 检查作业状态时,我收到以下消息:
...
JobState=RUNNING Reason=None Dependency=(null)
Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
...
NumNodes=1 NumCPUs=64 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
TRES=cpu=64,node=1

这是否意味着该作业使用 64 个 CPU 而不是作业脚本中指定的 1 个?如果是这样,我应该怎么做才能解决这个问题?我有以下 SLRUM 配置文件(/etc/slurm-llnl/slurm.conf):
ControlMachine=DDHP-P1-server
AuthType=auth/munge
CacheGroups=0
CryptoType=crypto/munge
MpiDefault=none
ProctrackType=proctrack/pgid
ReturnToService=1
SlurmctldPidFile=/var/run/slurm-llnl/slurmctld.pid
SlurmctldPort=6816
SlurmdPidFile=/var/run/slurm-llnl/slurmd.pid
SlurmdPort=6817
SlurmdSpoolDir=/var/lib/slurm-llnl/slurmd
SlurmUser=slurm
#StateSaveLocation=/var/lib/slurm-llnl/slurmctld
StateSaveLocation=/apps2/slurm-llnl/slurmctld
SwitchType=switch/none
TaskPlugin=task/none
InactiveLimit=0
KillWait=30
MinJobAge=300
SlurmctldTimeout=120
SlurmdTimeout=300
Waittime=0
FastSchedule=1
SchedulerType=sched/backfill
SchedulerPort=7321
SelectType=select/linear
AccountingStorageType=accounting_storage/none
AccountingStoreJobComment=YES
ClusterName=cluster
JobCompType=jobcomp/none
JobAcctGatherFrequency=30
JobAcctGatherType=jobacct_gather/none
SlurmctldDebug=3
SlurmctldLogFile=/var/log/slurm-llnl/slurmctld.log
SlurmdDebug=3
SlurmdLogFile=/var/log/slurm-llnl/slurmd.log

#ClusterName=(null) NodeName=DDHP-P1-server slurmd: Considering each NUMA node as a socket
#CPUs=64 Boards=1 SocketsPerBoard=8 CoresPerSocket=8 ThreadsPerCore=1 RealMemory=257940 TmpDisk=171660

#NodeName=DDHP-P1-server CPUs=64 RealMemory=264131 State=UNKNOWN
NodeName=DDHP-P1-server CPUs=64 Sockets=4 CoresPerSocket=8 ThreadsPerCore=2 RealMemory=252000 State=UNKNOWN
PartitionName=debug Nodes=DDHP-P1-server Default=YES MaxTime=INFINITE State=UP

谢谢你的协助! :)

最佳答案

问题是线路

SelectType=select/linear

在您的配置文件中。它指示 Slurm 将节点分配给作业。如果你想让 Slurm 分配核心,你需要
SelectType=select/cons_res
SelectTypeParameters=CR_Core_Memory

this documentation SelectTypeParameters 的替代选项

关于ubuntu - SLURM 不遵循请求的资源,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/49528178/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com