gpt4 book ai didi

azure - 集群自动缩放器无法使用 ACS-Engine 在 Azure 上从 0 开始扩展

转载 作者:行者123 更新时间:2023-12-02 11:48:26 25 4
gpt4 key购买 nike

我正在尝试使用 acs-engine 在 Azure 中设置集群,以利用 VMSS 作为代理池构建 Kubernetes 集群。集群启动后,我添加集群自动缩放器来管理 2 个专用代理池:1 个 cpu 和 1 个 gpu。只要规模集中仍有正在运行的虚拟机,缩减和扩展就可以进行。两个规模集都设置为缩小到 0。通过 ACS,我已经使用污点和自定义标签设置了这 2 个规模集。一旦规模集缩小到 0,我就无法在安排新 Pod 时让自动缩放器重新启动节点。我不确定我做错了什么,或者我是否缺少一些配置、标签、污点等。我最近刚刚开始使用 kubernetes。

下面是我的 acs-engine json、pod 定义以及自动缩放器和 pod 描述的日志。

来自kubectl logs -n kube-system cluster-autoscaler-5967b96496-jnvjr的输出

I0920 16:11:14.925761       1 scale_up.go:249] Pod default/my-test-pod is unschedulable
I0920 16:11:14.999323 1 utils.go:196] Pod my-test-pod can't be scheduled on k8s-pool2-24760778-vmss, predicate failed: GeneralPredicates predicate mismatch, cannot put default/my-test-pod on template-node-for-k8s-pool2-24760778-vmss-6220731686255962863, reason: node(s) didn't match node selector
I0920 16:11:14.999408 1 utils.go:196] Pod my-test-pod can't be scheduled on k8s-pool3-24760778-vmss, predicate failed: GeneralPredicates predicate mismatch, cannot put default/my-test-pod on template-node-for-k8s-pool3-24760778-vmss-3043543739698957784, reason: node(s) didn't match node selector
I0920 16:11:14.999442 1 scale_up.go:376] No expansion options

来自kubectl describe pod my-test-pod的输出

Name:               my-test-pod
Namespace: default
Priority: 0
PriorityClassName: <none>
Node: <none>
Labels: <none>
Annotations: kubectl.kubernetes.io/last-applied-configuration={"apiVersion":"v1","kind":"Pod","metadata":{"annotations":{},"name":"my-test-pod","namespace":"default"},"spec":{"affinity":{"nodeAffinity":{"preferred...
Status: Pending
IP:
Containers:
my-test-pod:
Image: ubuntu:latest
Port: <none>
Host Port: <none>
Command:
/bin/bash
-ec
while :; do echo '.'; sleep 5; done
Environment: <none>
Mounts:
/var/run/secrets/kubernetes.io/serviceaccount from default-token-qzm6s (ro)
Conditions:
Type Status
PodScheduled False
Volumes:
default-token-qzm6s:
Type: Secret (a volume populated by a Secret)
SecretName: default-token-qzm6s
Optional: false
QoS Class: BestEffort
Node-Selectors: agentpool=pool2
environment=DEV
hardware=cpu-spec
node-template=k8s-pool2-24760778-vmss
vmSize=Standard_D4s_v3
Tolerations: dedicated=pool2:NoSchedule
node.kubernetes.io/not-ready:NoExecute for 300s
node.kubernetes.io/unreachable:NoExecute for 300s
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning FailedScheduling 2m (x273 over 17m) default-scheduler 0/3 nodes are available: 3 node(s) didn't match node selector.
Normal NotTriggerScaleUp 2m (x89 over 17m) cluster-autoscaler pod didn't trigger scale-up (it wouldn't fit if a new node is added)

acs-engine 配置文件(使用 terraform 渲染和生成)

{
"apiVersion": "vlabs",
"properties": {
"orchestratorProfile": {
"orchestratorType": "Kubernetes",
"orchestratorRelease": "1.11",
"kubernetesConfig": {
"networkPlugin": "azure",
"clusterSubnet": "${cidr}",
"privateCluster": {
"enabled": true
},
"addons": [
{
"name": "nvidia-device-plugin",
"enabled": true
},
{
"name": "cluster-autoscaler",
"enabled": true,
"config": {
"minNodes": "0",
"maxNodes": "2",
"image": "gcr.io/google-containers/cluster-autoscaler:1.3.1"
}
}
]
}
},
"masterProfile": {
"count": ${master_vm_count},
"dnsPrefix": "${dns_prefix}",
"vmSize": "${master_vm_size}",
"storageProfile": "ManagedDisks",
"vnetSubnetId": "${pool_subnet_id}",
"firstConsecutiveStaticIP": "${first_master_ip}",
"vnetCidr": "${cidr}"
},
"agentPoolProfiles": [
{
"name": "pool3",
"count": ${dedicated_vm_count},
"vmSize": "${dedicated_vm_size}",
"storageProfile": "ManagedDisks",
"OSDiskSizeGB": 31,
"vnetSubnetId": "${pool_subnet_id}",
"customNodeLabels": {
"vmSize":"${dedicated_vm_size}",
"dedicatedOnly": "true",
"environment":"${environment}",
"hardware": "${dedicated_spec}"
},
"availabilityProfile": "VirtualMachineScaleSets",
"scaleSetEvictionPolicy": "Delete",
"kubernetesConfig": {
"kubeletConfig": {
"--register-with-taints": "dedicated=pool3:NoSchedule"
}
}
},
{
"name": "pool2",
"count": ${pool2_vm_count},
"vmSize": "${pool2_vm_size}",
"storageProfile": "ManagedDisks",
"OSDiskSizeGB": 31,
"vnetSubnetId": "${pool_subnet_id}",
"availabilityProfile": "VirtualMachineScaleSets",
"customNodeLabels": {
"vmSize":"${pool2_vm_size}",
"environment":"${environment}",
"hardware": "${pool_spec}"
},
"kubernetesConfig": {
"kubeletConfig": {
"--register-with-taints": "dedicated=pool2:NoSchedule"
}
}
},
{
"name": "pool1",
"count": ${pool1_vm_count},
"vmSize": "${pool1_vm_size}",
"storageProfile": "ManagedDisks",
"OSDiskSizeGB": 31,
"vnetSubnetId": "${pool_subnet_id}",
"availabilityProfile": "VirtualMachineScaleSets",
"customNodeLabels": {
"vmSize":"${pool1_vm_size}",
"environment":"${environment}",
"hardware": "${pool_spec}"
}
}
],
"linuxProfile": {
"adminUsername": "${admin_user}",
"ssh": {
"publicKeys": [
{
"keyData": "${ssh_key}"
}
]
}
},
"servicePrincipalProfile": {
"clientId": "${service_principal_client_id}",
"secret": "${service_principal_client_secret}"
}
}
}

Pod 配置文件

apiVersion: v1
kind: Pod
metadata:
name: my-test-pod
spec:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: vmSize
operator: In
values:
- Standard_D4s_v3
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 1
preference:
matchExpressions:
- key: hardware
operator: In
values:
- cpu-spec
nodeSelector:
agentpool: pool2
hardware: cpu-spec
vmSize: Standard_D4s_v3
environment: DEV
node-template: k8s-pool2-24760778-vmss
tolerations:
- key: dedicated
operator: Equal
value: pool2
effect: NoSchedule
containers:
- name: my-test-pod
image: ubuntu:latest
command: ["/bin/bash", "-ec", "while :; do echo '.'; sleep 5; done"]
restartPolicy: Never

我尝试过在 nodeAffinity/nodeSelector/Tolerations 中添加和删除它们,但结果都相同。

集群启动后,我将 pool2 添加到自动缩放器中。在互联网上搜索解决方案时,我不断遇到有关节点模板标签的帖子,我认为形式为 k8s.io/autoscaler/cluster-autoscaler/node-template/label/value,但这似乎是需要的对于AWS。

任何人都可以在 Azure 上为我提供任何指导吗?

谢谢。

最佳答案

更新。

我已经找到了这个问题的答案。通过删除 requiredDuringSchedulingIgnoreDuringExecution 节点关联性规则并仅使用 preferredDuringSchedulingIgnoreDuringExecution,调度程序可以在规模集中正确启动新的 VM。

apiVersion: v1
kind: Pod
metadata:
name: my-test-pod
spec:
affinity:
nodeAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 1
preference:
matchExpressions:
- key: hardware
operator: In
values:
- cpu-spec
nodeSelector:
agentpool: pool2
hardware: cpu-spec
vmSize: Standard_D4s_v3
environment: DEV
node-template: k8s-pool2-24760778-vmss
tolerations:
- key: dedicated
operator: Equal
value: pool2
effect: NoSchedule
containers:
- name: my-test-pod
image: ubuntu:latest
command: ["/bin/bash", "-ec", "while :; do echo '.'; sleep 5; done"]
restartPolicy: Never

关于azure - 集群自动缩放器无法使用 ACS-Engine 在 Azure 上从 0 开始扩展,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/52429472/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com