gpt4 book ai didi

kubernetes - Kubernetes自动缩放GPU节点

转载 作者:行者123 更新时间:2023-12-02 11:43:09 25 4
gpt4 key购买 nike

我在GKE上用GPU节点创建了一个小型集群,如下所示:

# create cluster and CPU nodes
gcloud container clusters create clic-cluster \
--zone us-west1-b \
--machine-type n1-standard-1 \
--enable-autoscaling \
--min-nodes 1 \
--max-nodes 3 \
--num-nodes 2

# add GPU nodes
gcloud container node-pools create gpu-pool \
--zone us-west1-b \
--machine-type n1-standard-2 \
--accelerator type=nvidia-tesla-k80,count=1 \
--cluster clic-cluster \
--enable-autoscaling \
--min-nodes 1 \
--max-nodes 2 \
--num-nodes 1

当我提交GPU作业时,它成功地结束了在GPU节点上的运行。但是,当我提交第二份工作时,我从kubernetes获得了 UnexpectedAdmissionError:

Update plugin resources failed due to requested number of devices unavailable for nvidia.com/gpu. Requested: 1, Available: 0, which is unexpected.



我本来希望集群启动第二个GPU节点并将作业放置在那里。知道为什么没有发生吗?我的工作规格大致如下:
apiVersion: batch/v1
kind: Job
metadata:
name: <job_name>
spec:
template:
spec:
initContainers:
- name: decode
image: "<decoder_image>"
resources:
limits:
nvidia.com/gpu: 1
command: [...]
[...]
containers:
- name: evaluate
image: "<evaluation_image>"
command: [...]

最佳答案

资源约束也需要添加到containers规范中:

piVersion: batch/v1
kind: Job
metadata:
name: <job_name>
spec:
template:
spec:
initContainers:
- name: decode
image: "<decoder_image>"
resources:
limits:
nvidia.com/gpu: 1
command: [...]
[...]
containers:
- name: evaluate
image: "<evaluation_image>"
resources:
limits:
nvidia.com/gpu: 1
command: [...]

我只需要在 initContainers之一中使用GPU,但这似乎使调度程序感到困惑。现在,自动缩放和计划可以按预期工作。

关于kubernetes - Kubernetes自动缩放GPU节点,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/58720596/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com