gpt4 book ai didi

docker - kube-apiserver docker持续重启

转载 作者:行者123 更新时间:2023-12-02 20:54:04 34 4
gpt4 key购买 nike

诚挚的歉意。
我有一个4节点的Kubernetes集群,其中包含1个主节点和3个工作节点。我使用kubeconfig连接到kubernetes集群,因为昨天我无法使用kubeconfig连接。kubectl get pods给出错误消息“与服务器api.xxxxx.xxxxxxxx.com的连接被拒绝-您指定了正确的主机或端口吗?”
在kubeconfig中将服务器名称指定为https://api.xxxxx.xxxxxxxx.com
注意:
请注意,因为有太多的https链接,所以我无法发布问题。因此,我已将https://重命名为https:-以避免后台分析部分中的链接。
我尝试从主节点运行kubectl并收到类似的错误
与服务器localhost:8080的连接被拒绝-您是否指定了正确的主机或端口?
然后检查kube-apiserver docker ,它正在不断退出/ Crashloopbackoff。docker logs <container-id of kube-apiserver>显示以下错误

W0914 16:29:25.761524 1 clientconn.go:1251] grpc:addrConn.createTransport failed to connect to {127.0.0.1:4001 0}. Err :connection error: desc = "transport: authenticationhandshake failed: x509: certificate has expired or is not yet valid".Reconnecting... F0914 16:29:29.319785 1 storage_decorator.go:57]Unable to create storage backend: config (&{etcd3 /registry{[https://127.0.0.1:4001]/etc/kubernetes/pki/kube-apiserver/etcd-client.key/etc/kubernetes/pki/kube-apiserver/etcd-client.crt/etc/kubernetes/pki/kube-apiserver/etcd-ca.crt} false true0xc000266d80 apiextensions.k8s.io/v1beta1 5m0s 1m0s}), err(context deadline exceeded)

systemctl status kubelet->提供了以下错误

Sep 14 16:40:49 ip-xxx-xxx-xx-xx kubelet[2411]: E0914 16:40:49.6935762411 kubelet_node_status.go:385] Error updating node status, willretry: error getting node"ip-xxx-xxx-xx-xx.xx-xxxxx-1.compute.internal": Gethttps://127.0.0.1/api/v1/nodes/ip-xxx-xxx-xx-xx.xx-xxxxx-1.compute.internal?timeout=10s:dial tcp 127.0.0.1:443: connect: connection refused


注意:ip-xxx-xx-xx-xxx-> AWS EC2实例的内部IP地址。
背景分析:
看起来集群在2020年9月7日出现了一些问题,并且kube-controller和kube-scheduler docker 都退出并重新启动。我相信从那时起kube-apiserver一直没有运行,或者由于kube-apiserver的原因,这些dockers重新启动了。 kube-apiserver服务器证书已于2020年7月到期,但通过kubectl进行的访问一直持续到9月7日。
以下是 docker logs from the exited kube-scheduler docker容器:

I0907 10:35:08.970384 1 scheduler.go:572] poddefault/k8version-1599474900-hrjcn is bound successfully on nodeip-xx-xx-xx-xx.xx-xxxxxx-x.compute.internal, 4 nodes evaluated, 3nodes were found feasible I0907 10:40:09.286831 1scheduler.go:572] pod default/k8version-1599475200-tshlx is boundsuccessfully on node ip-1x-xx-xx-xx.xx-xxxxxx-x.compute.internal, 4nodes evaluated, 3 nodes were found feasible I0907 10:44:01.935373
1 leaderelection.go:263] failed to renew leasekube-system/kube-scheduler: failed to tryAcquireOrRenew contextdeadline exceeded E0907 10:44:01.935420 1 server.go:252] lostmaster lost lease


以下是退出的kube-controller docker容器的docker日志:

I0907 10:40:19.703485 1 garbagecollector.go:518] delete object[v1/Pod, namespace: default, name: k8version-1599474300-5r6ph, uid:67437201-f0f4-11ea-b612-0293e1aee720] with propagation policyBackground I0907 10:44:01.937398 1 leaderelection.go:263] failedto renew lease kube-system/kube-controller-manager: failed totryAcquireOrRenew context deadline exceeded E0907 10:44:01.937506
1 leaderelection.go:306] error retrieving resource lockkube-system/kube-controller-manager: Get https:--127.0.0.1/api/v1/namespaces/kube-system/endpoints/kube-controller-manager?timeout=10s:net/http: request canceled (Client.Timeout exceeded while awaitingheaders) I0907 10:44:01.937456 1 event.go:209]Event(v1.ObjectReference{Kind:"Endpoints", Namespace:"kube-system",Name:"kube-controller-manager",UID:"ba172d83-a302-11e9-b612-0293e1aee720", APIVersion:"v1",ResourceVersion:"85406287", FieldPath:""}): type: 'Normal' reason:'LeaderElection' ip-xxx-xx-xx-xxx_1dd3c03b-bd90-11e9-85c6-0293e1aee720stopped leading F0907 10:44:01.937545 1controllermanager.go:260] leaderelection lost I0907 10:44:01.949274
1 range_allocator.go:169] Shutting down range CIDR allocator I090710:44:01.949285 1 replica_set.go:194] Shutting down replicasetcontroller I0907 10:44:01.949291 1 gc_controller.go:86] Shuttingdown GC controller I0907 10:44:01.949304 1pvc_protection_controller.go:111] Shutting down PVC protectioncontroller I0907 10:44:01.949310 1 route_controller.go:125]Shutting down route controller I0907 10:44:01.949316 1service_controller.go:197] Shutting down service controller I090710:44:01.949327 1 deployment_controller.go:164] Shutting downdeployment controller I0907 10:44:01.949435 1garbagecollector.go:148] Shutting down garbage collector controllerI0907 10:44:01.949443 1 resource_quota_controller.go:295]Shutting down resource quota controller


以下是自重启(9月7日)以来来自kube-controller的docker日志:

E0915 21:51:36.028108 1 leaderelection.go:306] error retrievingresource lock kube-system/kube-controller-manager: Gethttps:--127.0.0.1/api/v1/namespaces/kube-system/endpoints/kube-controller-manager?timeout=10s:dial tcp 127.0.0.1:443: connect: connection refused E091521:51:40.133446 1 leaderelection.go:306] error retrievingresource lock kube-system/kube-controller-manager: Gethttps:--127.0.0.1/api/v1/namespaces/kube-system/endpoints/kube-controller-manager?timeout=10s:dial tcp 127.0.0.1:443: connect: connection refused


以下是自重启(9月7日)以来来自kube-scheduler的docker日志:

E0915 21:52:44.703587 1 reflector.go:126]k8s.io/client-go/informers/factory.go:133: Failed to list *v1.Node:Get https://127.0.0.1/api/v1/nodes?limit=500&resourceVersion=0: dialtcp 127.0.0.1:443: connect: connection refused E0915 21:52:44.704504
1 reflector.go:126] k8s.io/client-go/informers/factory.go:133: Failedto list *v1.ReplicationController: Gethttps:--127.0.0.1/api/v1/replicationcontrollers?limit=500&resourceVersion=0: dial tcp 127.0.0.1:443: connect: connection refused E091521:52:44.705471 1 reflector.go:126]k8s.io/client-go/informers/factory.go:133: Failed to list *v1.Service:Get https:--127.0.0.1/api/v1/services?limit=500&resourceVersion=0:dial tcp 127.0.0.1:443: connect: connection refused E091521:52:44.706477 1 reflector.go:126]k8s.io/client-go/informers/factory.go:133: Failed to list*v1.ReplicaSet: Get https:--127.0.0.1/apis/apps/v1/replicasets?limit=500&resourceVersion=0:dial tcp 127.0.0.1:443: connect: connection refused E091521:52:44.707581 1 reflector.go:126]k8s.io/client-go/informers/factory.go:133: Failed to list*v1.StorageClass: Get https:--127.0.0.1/apis/storage.k8s.io/v1/storageclasses?limit=500&resourceVersion=0: dial tcp 127.0.0.1:443: connect: connection refused E091521:52:44.708599 1 reflector.go:126]k8s.io/client-go/informers/factory.go:133: Failed to list*v1.PersistentVolume: Get https:--127.0.0.1/api/v1/persistentvolumes?limit=500&resourceVersion=0:dial tcp 127.0.0.1:443: connect: connection refused E091521:52:44.709687 1 reflector.go:126]k8s.io/client-go/informers/factory.go:133: Failed to list*v1.StatefulSet: Get https:--127.0.0.1/apis/apps/v1/statefulsets?limit=500&resourceVersion=0:dial tcp 127.0.0.1:443: connect: connection refused E091521:52:44.710744 1 reflector.go:126]k8s.io/client-go/informers/factory.go:133: Failed to list*v1.PersistentVolumeClaim: Get https:--127.0.0.1/api/v1/persistentvolumeclaims?limit=500&resourceVersion=0: dial tcp 127.0.0.1:443: connect: connection refused E091521:52:44.711879 1 reflector.go:126]k8s.io/kubernetes/cmd/kube-scheduler/app/server.go:223: Failed to list*v1.Pod: Get https:--127.0.0.1/api/v1/pods?fieldSelector=status.phase%21%3DFailed%2Cstatus.phase%21%3DSucceeded&limit=500&resourceVersion=0:dial tcp 127.0.0.1:443: connect: connection refused E091521:52:44.712903 1 reflector.go:126]k8s.io/client-go/informers/factory.go:133: Failed to list*v1beta1.PodDisruptionBudget: Get https:--127.0.0.1/apis/policy/v1beta1/poddisruptionbudgets?limit=500&resourceVersion=0:dial tcp 127.0.0.1:443: connect: connection refused


kube-apiserver证书续订:
我发现kube-apiserver证书(即 /etc/kubernetes/pki/kube-apiserver/etcd-client.crt)已在2020年7月过期。与etcd-manager-main和events有关的其他过期证书很少(两个地方都是相同的证书副本),但我没有请参阅 list 文件中引用的内容。
我搜索并找到了续订证书的步骤,但是其中大多数都使用“kubeadm init phase”命令,但是在主服务器上找不到kubeadm,并且证书名称和路径与我的设置不同。因此,我使用openssl为kube-apiserver使用现有的ca cert生成了一个新证书,并使用openssl.cnf文件将DNS名称与内部和外部IP地址(ec2实例)和回送IP地址一起包括在内。我用相同的名称 /etc/kubernetes/pki/kube-apiserver/etcd-client.crt替换了新证书。
之后,我重新启动了kube-apiserver docker (正在不断退出)并重新启动了kubelet。现在证书到期消息不会出现,但是kube-apiserver会不断重新启动,我相信这是kube-controller和kube-scheduler docker容器错误的原因。
注意:
更换证书后,我尚未在主服务器上重新启动docker。
注意:我们所有的生产POD都在工作程序节点上运行,因此它们不受影响,但由于无法使用kubectl连接而无法管理它们。
现在,我不确定是什么问题,为什么kube-apiserver会不断重启。
更新至原始问题:
Kubernetes版本:v1.14.1
Docker版本:18.6.3
以下是最新的 docker logs from kube-apiserver container(仍在崩溃)

F0916 08:09:56.753538 1 storage_decorator.go:57] Unable to create storage backend: config (&{etcd3 /registry {[https:--127.0.0.1:4001] /etc/kubernetes/pki/kube-apiserver/etcd-client.key /etc/kubernetes/pki/kube-apiserver/etcd-client.crt /etc/kubernetes/pki/kube-apiserver/etcd-ca.crt} false true 0xc00095f050 apiextensions.k8s.io/v1beta1 5m0s 1m0s}), err (tls: private key does not match public key)


以下是 systemctl status kubelet的输出

Sep 16 08:10:16 ip-xxx-xx-xx-xx kubelet[388]: E0916 08:10:16.095615 388 kubelet.go:2244] node "ip-xxx-xx-xx-xx.xx-xxxxx-x.compute.internal" not found


Sep 16 08:10:16 ip-xxx-xx-xx-xx kubelet[388]: E0916 08:10:16.130377 388 kubelet.go:2170] Container runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:docker: network plugin is not ready: Kubenet does not have netConfig. This is most likely due to lack of PodCIDR


Sep 16 08:10:16 ip-xxx-xx-xx-xx kubelet[388]: E0916 08:10:16.147390 388 reflector.go:126] k8s.io/client-go/informers/factory.go:133: Failed to list *v1beta1.CSIDriver: Get https:--127.0.0.1/apis/storage.k8s.io/v1beta1/csidrivers?limit=500&resourceVersion=0: dial tcp 127.0.0.1:443: connect: connection refused


Sep 16 08:10:16 ip-xxx-xx-xx-xx kubelet[388]: E0916 08:10:16.195768 388 kubelet.go:2244] node "ip-xxx-xx-xx-xx.xx-xxxxx-x..compute.internal" not found


Sep 16 08:10:16 ip-xxx-xx-xx-xx kubelet[388]: E0916 08:10:16.295890 388 kubelet.go:2244] node "ip-xxx-xx-xx-xx.xx-xxxxx-x..compute.internal" not found


Sep 16 08:10:16 ip-xxx-xx-xx-xx kubelet[388]: E0916 08:10:16.347431 388 reflector.go:126] k8s.io/client-go/informers/factory.go:133: Failed to list *v1beta1.RuntimeClass: Get https://127.0.0.1/apis/node.k8s.io/v1beta1/runtimeclasses?limit=500&resourceVersion=0: dial tcp 127.0.0.1:443: connect: connection refused


此集群(以及其他3个集群)是使用kops设置的。其他群集正在正常运行,并且看起来它们也具有一些过期的证书。设置集群的人员无法发表评论,而我在Kubernetes方面的经验有限。因此需要大师的协助。
很感谢任何形式的帮助。
非常感谢。
在Zambozo和Nepomucen的回应后更新:
感谢你们俩的回应。基于此,我发现/ mnt挂载点上的etcd证书已过期。
我遵循了 https://kops.sigs.k8s.io/advisories/etcd-manager-certificate-expiration/的解决方法
并重新创建了etcd证书和密钥。我已经用旧证书(从备份文件夹中)的副本验证了每个证书,并且所有内容都匹配,新证书的到期日期设置为2021年9月。
现在我在etcd docker 上遇到了不同的错误(etcd-manager-events和etcd-manager-main都有)
注意:xxx-xx-xx-xxx是主服务器的IP地址

root@ip-xxx-xx-xx-xxx:~# docker logs <etcd-manager-main container> --tail 20I0916 14:41:40.349570 8221 peers.go:281] connecting to peer "etcd-a" with TLS policy, servername="etcd-manager-server-etcd-a"W0916 14:41:40.351857 8221 peers.go:325] unable to grpc-ping discovered peer xxx.xx.xx.xxx:3996: rpc error: code = Unavailable desc = all SubConns are in TransientFailureI0916 14:41:40.351878 8221 peers.go:347] was not able to connect to peer etcd-a: map[xxx.xx.xx.xxx:3996:true]W0916 14:41:40.351887 8221 peers.go:215] unexpected error from peer intercommunications: unable to connect to peer etcd-aI0916 14:41:41.205763 8221 controller.go:173] starting controller iterationW0916 14:41:41.205801 8221 controller.go:149] unexpected error running etcd cluster reconciliation loop: cannot find self "etcd-a" in list of peers []I0916 14:41:45.352008 8221 peers.go:281] connecting to peer "etcd-a" with TLS policy, servername="etcd-manager-server-etcd-a"I0916 14:41:46.678314 8221 volumes.go:85] AWS API Request: ec2/DescribeVolumesI0916 14:41:46.739272 8221 volumes.go:85] AWS API Request: ec2/DescribeInstancesI0916 14:41:46.786653 8221 hosts.go:84] hosts update: primary=map[], fallbacks=map[etcd-a.internal.xxxxx.xxxxxxx.com:[xxx.xx.xx.xxx xxx.xx.xx.xxx]], final=map[xxx.xx.xx.xxx:[etcd-a.internal.xxxxx.xxxxxxx.com etcd-a.internal.xxxxx.xxxxxxx.com]]I0916 14:41:46.786724 8221 hosts.go:181] skipping update of unchanged /etc/hosts


root@ip-xxx-xx-xx-xxx:~# docker logs <etcd-manager-events container> --tail 20W0916 14:42:40.294576 8316 peers.go:215] unexpected error from peer intercommunications: unable to connect to peer etcd-events-aI0916 14:42:41.106654 8316 controller.go:173] starting controller iterationW0916 14:42:41.106692 8316 controller.go:149] unexpected error running etcd cluster reconciliation loop: cannot find self "etcd-events-a" in list of peers []I0916 14:42:45.294682 8316 peers.go:281] connecting to peer "etcd-events-a" with TLS policy, servername="etcd-manager-server-etcd-events-a"W0916 14:42:45.297094 8316 peers.go:325] unable to grpc-ping discovered peer xxx.xx.xx.xxx:3997: rpc error: code = Unavailable desc = all SubConns are in TransientFailureI0916 14:42:45.297117 8316 peers.go:347] was not able to connect to peer etcd-events-a: map[xxx.xx.xx.xxx:3997:true]I0916 14:42:46.791923 8316 volumes.go:85] AWS API Request: ec2/DescribeVolumesI0916 14:42:46.856548 8316 volumes.go:85] AWS API Request: ec2/DescribeInstancesI0916 14:42:46.945119 8316 hosts.go:84] hosts update: primary=map[], fallbacks=map[etcd-events-a.internal.xxxxx.xxxxxxx.com:[xxx.xx.xx.xxx xxx.xx.xx.xxx]], final=map[xxx.xx.xx.xxx:[etcd-events-a.internal.xxxxx.xxxxxxx.com etcd-events-a.internal.xxxxx.xxxxxxx.com]]I0916 14:42:50.297264 8316 peers.go:281] connecting to peer "etcd-events-a" with TLS policy, servername="etcd-manager-server-etcd-events-a"W0916 14:42:50.300328 8316 peers.go:325] unable to grpc-ping discovered peer xxx.xx.xx.xxx:3997: rpc error: code = Unavailable desc = all SubConns are in TransientFailureI0916 14:42:50.300348 8316 peers.go:347] was not able to connect to peer etcd-events-a: map[xxx.xx.xx.xxx:3997:true]W0916 14:42:50.300360 8316 peers.go:215] unexpected error from peer intercommunications: unable to connect to peer etcd-events-a


您能否建议从这里继续进行?
非常感谢。

最佳答案

我认为这与ETCD有关。您可能已经为Kubernetes组件续订了证书,但是对ETCD做了同样的事情吗?
您的API服务器正在尝试连接到ETCD,并给出:

tls: private key does not match public key)
由于您只有1个etcd(假设主节点的数量),因此在尝试修复它之前将对其进行备份。

关于docker - kube-apiserver docker持续重启,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/63910627/

34 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com