作者热门文章
- html - 出于某种原因,IE8 对我的 Sass 文件中继承的 html5 CSS 不友好?
- JMeter 在响应断言中使用 span 标签的问题
- html - 在 :hover and :active? 上具有不同效果的 CSS 动画
- html - 相对于居中的 html 内容固定的 CSS 重复背景?
我正在尝试使用 aprun 运行多节点作业。但是,我无法弄清楚如何在 bash 环境中获得等级(或作为每个工作的 ID 的任何东西)。就像这个简单的工作:
aprun -n 8 -N 2 ./examplebashscript.sh
我怎样才能获得每个派生职位的排名?如果没有等级或任何唯一的作业 ID,此 aprun 行只会运行完全相同的程序 16 次,这是不可取的。
我一直在阅读文档,令人惊讶的是我找不到任何解释 aprun 提供的默认变量的内容。
我之前使用过 mpirun,我知道如何使用 C 和 Python 程序获取每个作业的排名值,但在 Bash 中则不知道。 aprun 的文档更少。
最佳答案
一种可行的方法是编写一个包装器脚本,该脚本可以获取要运行的任务列表,然后将每个任务生成一个单独的脚本。
在您的片段中,您似乎希望每个计算节点运行 2 个脚本实例以总共获得 8 个,因此,在您的作业脚本中,您可以执行如下操作:
for (( i=0; i<8; i+=2 )); do
aprun -n 1 ./wrapper.sh $i 2 &
done
wait
然后在 wrapper 中你可以做类似的事情(其中 $j 给你你的唯一索引):
end=$(( $1 + $2 ))
for (( j=$1; j<$end; j+=1 )); do
./examplebashscript.sh $j &
done
wait
您还可以设置以下环境变量来获取各种进程和线程的位置。在运行“aprun”之前,您需要在 shell(或作业脚本)中设置这些:
export MPICH_CPUMASK_DISPLAY=1
export MPICH_RANK_REORDER_DISPLAY=1
例如,运行:
aprun -n 24 ./examplebashscript.sh
(等同于的速记):
aprun -n 24 -N 24 -S 12 -d 1 ./examplebashscript.sh
将在 STDERR 上为您提供以下类型的输出(注意这是在 XC30 上,每个计算节点有两个 Intel Ivy Bridge 12 核处理器,因此由于存在超线程,掩码显示每个节点有 48 个核心):
[PE_0]: MPI rank order: Using default aprun rank ordering.
[PE_0]: rank 0 is on nid02749
[PE_0]: rank 1 is on nid02749
[PE_0]: rank 2 is on nid02749
[PE_0]: rank 3 is on nid02749
[PE_0]: rank 4 is on nid02749
[PE_0]: rank 5 is on nid02749
[PE_0]: rank 6 is on nid02749
[PE_0]: rank 7 is on nid02749
[PE_0]: rank 8 is on nid02749
[PE_0]: rank 9 is on nid02749
[PE_0]: rank 10 is on nid02749
[PE_0]: rank 11 is on nid02749
[PE_0]: rank 12 is on nid02749
[PE_0]: rank 13 is on nid02749
[PE_0]: rank 14 is on nid02749
[PE_0]: rank 15 is on nid02749
[PE_0]: rank 16 is on nid02749
[PE_0]: rank 17 is on nid02749
[PE_0]: rank 18 is on nid02749
[PE_0]: rank 19 is on nid02749
[PE_0]: rank 20 is on nid02749
[PE_0]: rank 21 is on nid02749
[PE_0]: rank 22 is on nid02749
[PE_0]: rank 23 is on nid02749
[PE_23]: cpumask set to 1 cpu on nid02749, cpumask = 000000000000000000000000100000000000000000000000
[PE_22]: cpumask set to 1 cpu on nid02749, cpumask = 000000000000000000000000010000000000000000000000
[PE_21]: cpumask set to 1 cpu on nid02749, cpumask = 000000000000000000000000001000000000000000000000
[PE_0]: cpumask set to 1 cpu on nid02749, cpumask = 000000000000000000000000000000000000000000000001
[PE_20]: cpumask set to 1 cpu on nid02749, cpumask = 000000000000000000000000000100000000000000000000
[PE_9]: cpumask set to 1 cpu on nid02749, cpumask = 000000000000000000000000000000000000001000000000
[PE_11]: cpumask set to 1 cpu on nid02749, cpumask = 000000000000000000000000000000000000100000000000
[PE_10]: cpumask set to 1 cpu on nid02749, cpumask = 000000000000000000000000000000000000010000000000
[PE_8]: cpumask set to 1 cpu on nid02749, cpumask = 000000000000000000000000000000000000000100000000
[PE_1]: cpumask set to 1 cpu on nid02749, cpumask = 000000000000000000000000000000000000000000000010
[PE_2]: cpumask set to 1 cpu on nid02749, cpumask = 000000000000000000000000000000000000000000000100
[PE_18]: cpumask set to 1 cpu on nid02749, cpumask = 000000000000000000000000000001000000000000000000
[PE_7]: cpumask set to 1 cpu on nid02749, cpumask = 000000000000000000000000000000000000000010000000
[PE_15]: cpumask set to 1 cpu on nid02749, cpumask = 000000000000000000000000000000001000000000000000
[PE_3]: cpumask set to 1 cpu on nid02749, cpumask = 000000000000000000000000000000000000000000001000
[PE_6]: cpumask set to 1 cpu on nid02749, cpumask = 000000000000000000000000000000000000000001000000
[PE_16]: cpumask set to 1 cpu on nid02749, cpumask = 000000000000000000000000000000010000000000000000
[PE_14]: cpumask set to 1 cpu on nid02749, cpumask = 000000000000000000000000000000000100000000000000
[PE_13]: cpumask set to 1 cpu on nid02749, cpumask = 000000000000000000000000000000000010000000000000
[PE_12]: cpumask set to 1 cpu on nid02749, cpumask = 000000000000000000000000000000000001000000000000
[PE_4]: cpumask set to 1 cpu on nid02749, cpumask = 000000000000000000000000000000000000000000010000
[PE_5]: cpumask set to 1 cpu on nid02749, cpumask = 000000000000000000000000000000000000000000100000
[PE_17]: cpumask set to 1 cpu on nid02749, cpumask = 000000000000000000000000000000100000000000000000
[PE_19]: cpumask set to 1 cpu on nid02749, cpumask = 000000000000000000000000000010000000000000000000
您也许能够以某种方式捕获它以供使用。
关于bash - 如何在 aprun 中获得排名,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/29039056/
我是一名优秀的程序员,十分优秀!