gpt4 book ai didi

python - slurm 超出了 python 多处理的作业内存限制

转载 作者:太空宇宙 更新时间:2023-11-03 21:47:04 27 4
gpt4 key购买 nike

我正在使用 slurm 来管理我们的一些计算,但有时作业会因内存不足错误而被终止,尽管情况不应该如此。这个奇怪的问题尤其出现在使用多处理的 python 作业中。

这是重现此行为的最小示例

#!/usr/bin/python

from time import sleep

nmem = int(3e7) # this will amount to ~1GB of numbers
nprocs = 200 # will create this many workers later
nsleep = 5 # sleep seconds

array = list(range(nmem)) # allocate some memory

print("done allocating memory")
sleep(nsleep)
print("continuing with multiple processes (" + str(nprocs) + ")")

from multiprocessing import Pool

def f(i):
sleep(nsleep)

# this will create a pool of workers, each of which "seem" to use 1GB
# even though the individual processes don't actually allocate any memory
p = Pool(nprocs)
p.map(f,list(range(nprocs)))

print("finished successfully")

尽管这可能在本地运行良好,但 slurm 内存计数似乎汇总了每个进程的驻留内存,导致内存使用量为 nprocs x 1GB,而不仅仅是 1 GB(实际内存使用量)。我认为这不是它应该做的事情,也不是操作系统正在做的事情,它似乎没有进行交换或任何操作。

如果我在本地运行代码,这是输出

> python test-slurm-mem.py 
done allocation memory
continuing with multiple processes (0)
finished successfully

还有htop的截图

htop showing a total mem use of 8GB with 200 processes of 1GB

这是我使用 slurm 运行相同命令的输出

> srun --nodelist=compute3 --mem=128G python test-slurm-mem.py 
srun: job 694697 queued and waiting for resources
srun: job 694697 has been allocated resources
done allocating memory
continuing with multiple processes (200)
slurmstepd: Step 694697.0 exceeded memory limit (193419088 > 131968000), being killed
srun: Exceeded job memory limit
srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
slurmstepd: *** STEP 694697.0 ON compute3 CANCELLED AT 2018-09-20T10:22:53 ***
srun: error: compute3: task 0: Killed
> $ sacct --format State,ExitCode,JobName,ReqCPUs,MaxRSS,AveCPU,Elapsed -j 694697.0
State ExitCode JobName ReqCPUS MaxRSS AveCPU Elapsed
---------- -------- ---------- -------- ---------- ---------- ----------
CANCELLED+ 0:9 python 2 193419088K 00:00:04 00:00:13

最佳答案

对于其他人:正如评论中模糊指出的那样,您需要更改文件slurm.conf。在此文件中,您需要将选项 JobAcctGatherType 设置为 jobacct_gather/cgroup(完整行:JobAcctGatherType=jobacct_gather/cgroup)。

我之前将选项设置为 jobacct_gather/linux ,这导致了问题中所述的错误会计值。

关于python - slurm 超出了 python 多处理的作业内存限制,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/52421171/

27 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com