• kubernetes部署和运行维护中的错误汇总(不定时更新)


    一,

    安装的etcd版本是3.4,如果是安装的etcd3下面的配置应该不会报错。

    查询etcd状态报错: conflicting environment variable "ETCD_NAME" is shadowed by corresponding command-line flag (either  unset environment variable or disable flag)”

    这个可以大概翻译一下,就是说环境变量ETCD_NAME冲突了,由相应的命令行标志隐藏(解决方法是要么取消环境变量,要么禁用这个环境变量。)

    那么,这个环境变量设置在哪里了?

    启动脚本:

    1. [root@master bin]# cat /usr/lib/systemd/system/etcd.service
    2. [Unit]
    3. Description=Etcd Server
    4. After=network.target
    5. After=network-online.target
    6. Wants=network-online.target
    7. [Service]
    8. Type=notify
    9. EnvironmentFile=/opt/etcd/cfg/etcd.conf
    10. ExecStart=/opt/etcd/bin/etcd \
    11. --initial-advertise-peer-urls http://${THIS_IP}:${THIS_PORT_PEER}
    12. --listen-peer-urls http://${THIS_IP}:${THIS_PORT_PEER} \
    13. --advertise-client-urls http://${THIS_IP}:${THIS_PORT_API}
    14. --listen-client-urls http://${THIS_IP}:${THIS_PORT_API} \
    15. --initial-cluster ${CLUSTER} \
    16. --initial-cluster-state ${CLUSTER_STATE} --initial-cluster-token ${TOKEN}
    17. --cert-file=/opt/etcd/ssl/server.pem \
    18. --key-file=/opt/etcd/ssl/server-key.pem \
    19. --peer-cert-file=/opt/etcd/ssl/server.pem \
    20. --peer-key-file=/opt/etcd/ssl/server-key.pem \
    21. --trusted-ca-file=/opt/etcd/ssl/ca.pem \
    22. --peer-trusted-ca-file=/opt/etcd/ssl/ca.pem
    23. Restart=on-failure
    24. LimitNOFILE=65536
    25. [Install]
    26. WantedBy=multi-user.target

    etcd的配置文件:

    1. [root@master bin]# cat /opt/etcd/cfg/etcd.conf
    2. #[Member]
    3. ETCD_NAME="etcd-1"
    4. ETCD_DATA_DIR="/var/lib/etcd/default.etcd"
    5. ETCD_LISTEN_PEER_URLS="https://192.168.217.16:2380"
    6. ETCD_LISTEN_CLIENT_URLS="https://192.168.217.16:2379"
    7. #[Clustering]
    8. ETCD_INITIAL_ADVERTISE_PEER_URLS="https://192.168.217.16:2380"
    9. ETCD_ADVERTISE_CLIENT_URLS="https://192.168.217.16:2379"
    10. ETCD_INITIAL_CLUSTER="etcd-1=https://192.168.217.16:2380,etcd-2=https://192.168.217.17:2380,etcd-3=https://192.168.217.18:2380"
    11. ETCD_INITIAL_CLUSTER_TOKEN="etcd-cluster"
    12. ETCD_INITIAL_CLUSTER_STATE="new"

    根本原因是etcd-3.4只需要一个配置就可以了,两个文件都写了initial初始化,就冲突啦,而etcd-3.3以及之前的版本不会报错,两个文件可以重复配置。因此,将启动脚本内的initial相关删除就可以了,也就是这些内容:

    1. --initial-advertise-peer-urls http://${THIS_IP}:${THIS_PORT_PEER}
    2. --listen-peer-urls http://${THIS_IP}:${THIS_PORT_PEER} \
    3. --advertise-client-urls http://${THIS_IP}:${THIS_PORT_API}
    4. --listen-client-urls http://${THIS_IP}:${THIS_PORT_API} \
    5. --initial-cluster ${CLUSTER} \
    6. --initial-cluster-state ${CLUSTER_STATE} --initial-cluster-token ${TOKEN}

    删除完毕后,重启服务,etcd恢复正常。

    1. [root@master bin]# systemctl status etcd
    2. ● etcd.service - Etcd Server
    3. Loaded: loaded (/usr/lib/systemd/system/etcd.service; enabled; vendor preset: disabled)
    4. Active: active (running) since Thu 2022-08-25 17:25:08 CST; 2h 31min ago
    5. Main PID: 3998 (etcd)
    6. Memory: 43.7M
    7. CGroup: /system.slice/etcd.service
    8. └─3998 /opt/etcd/bin/etcd --cert-file=/opt/etcd/ssl/server.pem --key-file=/opt/etcd/ssl/server-key.pem --peer-cert-file=/opt/etcd/ssl/server.pem --peer-key-file=/opt/etc...
    9. Aug 25 17:25:08 master etcd[3998]: raft2022/08/25 17:25:08 INFO: 1a58a86408898c44 became follower at term 81
    10. Aug 25 17:25:08 master etcd[3998]: raft2022/08/25 17:25:08 INFO: 1a58a86408898c44 [logterm: 1, index: 3, vote: 0] cast MsgVote for e078026890aff6e3 [logterm: 2, index: 5] at term 81
    11. Aug 25 17:25:08 master etcd[3998]: raft2022/08/25 17:25:08 INFO: raft.node: 1a58a86408898c44 elected leader e078026890aff6e3 at term 81
    12. Aug 25 17:25:08 master etcd[3998]: published {Name:etcd-1 ClientURLs:[https://192.168.217.16:2379]} to cluster e4c1916e49e5defc
    13. Aug 25 17:25:08 master etcd[3998]: ready to serve client requests
    14. Aug 25 17:25:08 master systemd[1]: Started Etcd Server.
    15. Aug 25 17:25:08 master etcd[3998]: serving client requests on 192.168.217.16:2379
    16. Aug 25 17:25:08 master etcd[3998]: set the initial cluster version to 3.4
    17. Aug 25 17:25:08 master etcd[3998]: enabled capabilities for version 3.4

    二,coredns的pod一直是CrashLoopBackOff

    查看pod日志有报错:/etc/coredns/Corefile:3 -Error during parsing: Unknow driective ‘proxy’

    查看cm相关文件,内容如下:

    1. apiVersion: v1
    2. kind: ConfigMap
    3. metadata:
    4. name: coredns
    5. namespace: kube-system
    6. data:
    7. Corefile: |
    8. .:53 {
    9. errors
    10. log
    11. health
    12. kubernetes cluster.local 10.254.0.0/18
    13. proxy . /etc/resolv.conf
    14. cache 30
    15. }

    倒数第二行更改为forward . /etc/resolv.conf,删除coredns的pod重新生成pod,错误消失,pod转为正常running状态。

    三,

    kubectl exec -it pod名称  报错:error: unable to upgrade connection: Forbidden (user=k8s-apiserver, verb=create, resource=nodes, sub

    其原因是括号里的user  k8s-apiserver这个用户没有权限,当然,可能别的用户也没权限,只看括号里的user后面是什么,给该用户提权到cluster-admin就可以了:

    kubectl create clusterrolebinding k8s-apiserver   --clusterrole=cluster-admin   --user=k8s-apiserver

    也可以这样 ,(cluster-admin 是集群的管理员权限)。

    kubectl create clusterrolebinding k8s-apiserver --clusterrole=system:admin --user=k8s-apiserver

    四,kubelet服务启动,短时间看正常,过1 2 分钟就停止了。

    查询kubelet服务状态,是失败的,查看系统日志/var/log/messages可以看到如下内容:

    F0827 15:18:26.995457   29538 server.go:274] failed to run Kubelet: misconfiguration: kubelet cgroup driver: "cgroupfs" is different from docker cgroup driver: "systemd"

    这个说的是docker底层用的引擎和kubelet里定义的引擎不一致,kubelet没有办法启动(当然了,出这个问题肯定是二进制安装啦,如果是kubeadm,它自动就给你调整好了)。 

    kubelet的配置文件:

    1. kind: KubeletConfiguration
    2. apiVersion: kubelet.config.k8s.io/v1beta1
    3. address: 0.0.0.0
    4. port: 10250
    5. readOnlyPort: 10255
    6. cgroupDriver: cgroupfs
    7. clusterDNS:
    8. - 10.0.0.2

    docker的配置文件:

    1. [root@slave1 ~]# cat /etc/docker/daemon.json
    2. {
    3. "registry-mirrors": ["http://bc437cce.m.daocloud.io"],
    4. "exec-opts":["native.cgroupdriver=systemd"],
    5. "log-driver": "json-file",
    6. "log-opts": {
    7. "max-size": "100m"
    8. },
    9. "storage-driver": "overlay2"
    10. }

    任意的修改,两者改成一致就可以了,比如,docker的配置文件修改成"exec-opts":["native.cgroupdriver=cgroupfs"], 然后重启docker服务就可以啦。或者是cgroupDriver: systemd 在启动kubelet服务就可以了,总之,两边一致就行。

    五,

    错误现象:

    集群状态正常,node节点看的正常,pod里的kube-flannel一会是running,一会就转到CrashLoopBackOff状态。

    1. [root@master ~]# k get po -A -owide
    2. NAMESPACE NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
    3. default busybox-7bf6d6f9b5-jg922 1/1 Running 2 23h 10.244.0.12 k8s-master
    4. default dns-test 0/1 Error 1 27h k8s-node1
    5. default nginx-7c96855774-28b5w 1/1 Running 2 29h 10.244.0.11 k8s-master
    6. default nginx-7c96855774-4b5vg 0/1 Completed 1 29h k8s-node1
    7. default nginx1 0/1 Error 1 27h k8s-node2
    8. kube-system coredns-76648cbfc9-lb75g 0/1 Completed 1 24h k8s-node2
    9. kube-system kube-flannel-ds-mhkdq 0/1 CrashLoopBackOff 11 29h 192.168.217.17 k8s-node1
    10. kube-system kube-flannel-ds-mlb7l 0/1 CrashLoopBackOff 11 29h 192.168.217.18 k8s-node2
    11. kube-system kube-flannel-ds-sl4qv 1/1 Running 4 29h 192.168.217.16 k8s-master

    解决思路:

    master节点都正常,那么,应该是某个服务master和work node不一样,几个关键服务状态一看,是kube-proxy 服务在node节点没有启动,启动后删除不正常的pod(不删除也可以,只是需要等待的时间长一些而已):

    在node1和node2上执行命令:

    systemctl start kube-proxy

    在master上等待,自动的,它的重启次数加1变成12了:

    1. [root@master ~]# k get po -A -owide
    2. NAMESPACE NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
    3. default busybox-7bf6d6f9b5-jg922 1/1 Running 2 23h 10.244.0.12 k8s-master
    4. default dns-test 0/1 ImagePullBackOff 1 27h 10.244.1.7 k8s-node1
    5. default nginx-7c96855774-28b5w 1/1 Running 2 29h 10.244.0.11 k8s-master
    6. default nginx-7c96855774-4b5vg 1/1 Running 2 29h 10.244.1.6 k8s-node1
    7. default nginx1 1/1 Running 2 27h 10.244.2.10 k8s-node2
    8. kube-system coredns-76648cbfc9-lb75g 1/1 Running 2 24h 10.244.2.11 k8s-node2
    9. kube-system kube-flannel-ds-mhkdq 1/1 Running 12 30h 192.168.217.17 k8s-node1
    10. kube-system kube-flannel-ds-mlb7l 1/1 Running 12 30h 192.168.217.18 k8s-node2
    11. kube-system kube-flannel-ds-sl4qv 1/1 Running 4 30h 192.168.217.16 k8s-master

    六,

    故障现象:

    集群搭建完毕后,使用token方式可以登录,但使用config文件方式无法登录,dashboard控制台右上角报错:clusterrolebindings.rbac.authorization.k8s.io is forbidden: User "kubelet-bootstrap" cannot list resouces,所有资源在dashboard上无法显示。

    解决方案:

    查询clusterrolebindings

    1. [root@master ~]# kubectl get clusterrolebindings kubelet-bootstrap
    2. NAME ROLE AGE
    3. kubelet-bootstrap ClusterRole/system:node-bootstrapper 16s

    看到这个用户是非admin权限,因此,将这个用户赋予cluster-admin即可:

    1. [root@master ~]# kubectl delete clusterrolebindings kubelet-bootstrap
    2. clusterrolebinding.rbac.authorization.k8s.io "kubelet-bootstrap" deleted
    3. [root@master ~]# kubectl create clusterrolebinding kubelet-bootstrap --clusterrole=cluster-admin --user=kubelet-bootstrap
    4. clusterrolebinding.rbac.authorization.k8s.io/kubelet-bootstrap created

    再次查询:

    1. [root@master ~]# kubectl get clusterrolebindings kubelet-bootstrap
    2. NAME ROLE AGE
    3. kubelet-bootstrap ClusterRole/cluster-admin 4m38s

    稍作总结:报错的用户先查看它的权限,如果不是admin,那么,立刻赋权就可以解决问题了。

    七,

    错误现象:

    进行ingress-nginx测试的时候报错,测试文件是:

    1. [root@master ~]# cat ingress-nginx.yaml
    2. apiVersion: networking.k8s.io/v1beta1
    3. kind: Ingress
    4. metadata:
    5. annotations:
    6. kubernetes.io/ingress.class: "nginx"
    7. name: example
    8. spec:
    9. rules: # 一个ingress可以配置多个rules
    10. - host: foo.bar.com # 域名配置,可以不写,匹配*,或者写 *.bar.com
    11. http:
    12. paths: # 相当于nginx的location,同一个host可以配置多个path
    13. - backend:
    14. serviceName: svc-demo # 代理到哪个svc
    15. servicePort: 8080 # svc的端口
    16. path: /

    执行文件时:

    1. [root@master ~]# k apply -f ingress-nginx.yaml
    2. Error from server (InternalError): error when creating "ingress-nginx.yaml": Internal error occurred: failed calling webhook "validate.nginx.ingress.kubernetes.io": Post https://ingress3-ingress-nginx-controller-admission.ingress-nginx.svc:443/networking/v1beta1/ingresses?timeout=10s: EOF

    解决方案:

    1. [root@master ~]# kubectl get validatingwebhookconfigurations
    2. NAME WEBHOOKS AGE
    3. ingress3-ingress-nginx-admission 1 43m
    4. [root@master ~]# kubectl delete -A ValidatingWebhookConfiguration ingress3-ingress-nginx-admission
    5. validatingwebhookconfiguration.admissionregistration.k8s.io "ingress3-ingress-nginx-admission" deleted

    先查询出webhooks,然后删除,再次执行测试文件成功。

    八,

    错误现象:helm3 程序无法使用,helm list都看不了,错误代码如下:

    1. [root@master ~]# helm list
    2. Error: Kubernetes cluster unreachable
    3. [root@master ~]# helm repo list
    4. Error: no repositories to show

    解决方案:

    设定环境变量,export KUBECONFIG=你的config文件,我的config文件名称和路径是/opt/kubernetes/cfg/bootstrap.kubeconfig,因此,

    export KUBECONFIG=/opt/kubernetes/cfg/bootstrap.kubeconfig

    环境变量设置好后就可以正常使用helm了。

    九,

    故障现象:在主节点也就是master kubectl get nodes  ,看不到其中的一个工作节点node1,只看到了node2

    1. [root@master cfg]# k get no -owide
    2. NAME STATUS ROLES AGE VERSION INTERNAL-IP EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME
    3. k8s-master NotReady 60m v1.18.3 192.168.217.16 CentOS Linux 7 (Core) 5.16.9-1.el7.elrepo.x86_64 docker://20.10.7
    4. k8s-node2 NotReady 17m v1.18.3 192.168.217.18 CentOS Linux 7 (Core) 5.16.9-1.el7.elrepo.x86_64 docker://20.10.7

    在node1节点,查看系统日志,内容如下:

    1. Aug 30 12:21:05 slave1 kubelet: E0830 12:21:05.037318 21865 kubelet.go:2267] node "k8s-node1" not found
    2. Aug 30 12:21:05 slave1 kubelet: E0830 12:21:05.138272 21865 kubelet.go:2267] node "k8s-node1" not found
    3. Aug 30 12:21:05 slave1 kubelet: E0830 12:21:05.239285 21865 kubelet.go:2267] node "k8s-node1" not found
    4. Aug 30 12:21:05 slave1 kubelet: E0830 12:21:05.340365 21865 kubelet.go:2267] node "k8s-node1" not found
    5. Aug 30 12:21:05 slave1 kubelet: E0830 12:21:05.441356 21865 kubelet.go:2267] node "k8s-node1" not found
    6. Aug 30 12:21:05 slave1 kubelet: E0830 12:21:05.542351 21865 kubelet.go:2267] node "k8s-node1" not found
    7. Aug 30 12:21:05 slave1 kubelet: E0830 12:21:05.643332 21865 kubelet.go:2267] node "k8s-node1" not found
    8. Aug 30 12:21:05 slave1 kubelet: E0830 12:21:05.744277 21865 kubelet.go:2267] node "k8s-node1" not found
    9. Aug 30 12:21:05 slave1 kubelet: E0830 12:21:05.845217 21865 kubelet.go:2267] node "k8s-node1" not found
    10. Aug 30 12:21:05 slave1 kubelet: E0830 12:21:05.946301 21865 kubelet.go:2267] node "k8s-node1" not found
    11. Aug 30 12:21:06 slave1 kubelet: E0830 12:21:06.047337 21865 kubelet.go:2267] node "k8s-node1" not found
    12. Aug 30 12:21:06 slave1 kubelet: E0830 12:21:06.593145 21865 controller.go:136] failed to ensure node lease exists, will retry in 7s, error: leases.coordination.k8s.io "k8s-node1" is forbidden: User "system:node:k8s-node2" cannot get resource "leases" in API group "coordination.k8s.io" in the namespace "kube-node-lease": can only access node lease with the same name as the requesting node

    解决方案:

    根据以上的日志,可以看到是kubelet服务有问题,具体原因是配置文件写错了,本来应该写node1的,错误写成node2了,而node2已经注册成功了。

    具体解决方法为删除node1节点内的证书,kubelet-client-current.pem 将此文件删除后,重启kubelet服务,在master节点就可以看到可以正常获取node了:

    1. [root@master cfg]# k get no -owide
    2. NAME STATUS ROLES AGE VERSION INTERNAL-IP EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME
    3. k8s-master NotReady 64m v1.18.3 192.168.217.16 CentOS Linux 7 (Core) 5.16.9-1.el7.elrepo.x86_64 docker://20.10.7
    4. k8s-node1 NotReady 9s v1.18.3 192.168.217.17 CentOS Linux 7 (Core) 5.16.9-1.el7.elrepo.x86_64 docker://20.10.7
    5. k8s-node2 NotReady 20m v1.18.3 192.168.217.18 CentOS Linux 7 (Core) 5.16.9-1.el7.elrepo.x86_64 docker://20.10.7

    十,建立pv存储的时候报错

    故障现象:

    建立pv的时候报错如下:

    1. [root@master mysql]# k apply -f mysql-pv.yaml
    2. The PersistentVolume "mysql-pv" is invalid: nodeAffinity: Invalid value: core.VolumeNodeAffinity{Required:(*core.NodeSelector)(0xc002d9bf20)}: field is immutable

    查看pv状态如下(pv是错误状态哦,available):

    1. [root@master mysql]# k get pv
    2. NAME CAPACITY ACCESS MODES RECLAIM POLICY STATUS CLAIM STORAGECLASS REASON AGE
    3. mysql-pv 15Gi RWO Delete Available local-storage 12m

    原因分析:

    建立pv的内容如下:

    1. [root@master mysql]# cat mysql-pv.yaml
    2. apiVersion: v1
    3. kind: PersistentVolume
    4. metadata:
    5. name: mysql-pv
    6. spec:
    7. capacity:
    8. storage: 15Gi
    9. volumeMode: Filesystem
    10. accessModes:
    11. - ReadWriteOnce
    12. persistentVolumeReclaimPolicy: Delete
    13. storageClassName: local-storage
    14. local: # 指定它是一个 Local Persistent Volume
    15. path: /mnt/mysql-data # PV对应的本地磁盘路径
    16. nodeAffinity: # 亲和性标志
    17. required:
    18. nodeSelectorTerms:
    19. - matchExpressions:
    20. - key: kubernetes.io/hostname
    21. operator: In
    22. values:
    23. - k8s-node1 # 必须部署在node1上

    此前修改了values的值,也就是说pv已经建立过一次了,导致亲和度的值再次赋值失效。

    解决方案:

    删除此前建立的pv,重新apply,再次查看,pv和pvc都正常了:

    1. [root@master mysql]# k get pv
    2. NAME CAPACITY ACCESS MODES RECLAIM POLICY STATUS CLAIM STORAGECLASS REASON AGE
    3. mysql-pv 15Gi RWO Delete Bound default/mysql-pv-claim local-storage 9m5s
    4. [root@master mysql]# k get pvc
    5. NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE
    6. mysql-pv-claim Bound mysql-pv 15Gi RWO local-storage 57m

    十一,

    故障现象:创建ingress资源文件失败,报错如下:

    1. [root@master ~]# k apply -f ingress-http.yaml
    2. Error from server (InternalError): error when creating "ingress-http.yaml": Internal error occurred: failed calling webhook "validate.nginx.ingress.kubernetes.io": Post https://ingress-nginx-controller-admission.ingress-nginx.svc:443/networking/v1beta1/ingresses?timeout=10s: context deadline exceeded

    vim ingress-http.yaml

    1. apiVersion: extensions/v1beta1
    2. kind: Ingress
    3. metadata:
    4. name: ingress-http
    5. namespace: dev
    6. annotations:
    7. nginx.ingress.kubernetes.io/rewrite-target: /
    8. spec:
    9. rules:
    10. - host: nginx.test.com
    11. http:
    12. paths:
    13. - path: /
    14. backend:
    15. serviceName: nginx-service
    16. servicePort: 80
    17. - host: tomcat.test.com
    18. http:
    19. paths:
    20. - path: /
    21. backend:
    22. serviceName: tomcat-service
    23. servicePort: 80

    原因分析:回忆了一哈,此文件是修改了最后的端口,一开始是8080,后面修改为了80,但ingress不能够自动升级,因为ingress创建使用到了webhook,但webhook已经隐式创建了,因此解决方案为删除ValidatingWebhookConfiguration

    解决方案:

    先查询有哪些ValidatingWebhookConfiguration

    kubectl get ValidatingWebhookConfiguration

    删除kubectl get ValidatingWebhookConfiguration

    kubectl delete -A  ValidatingWebhookConfiguration ingress-nginx-admission

    十二,

    故障现象:

    pod启动不了,kube-proxy kube-controller-manage都报错,查看系统日志,真正的满江红:

    master节点的kube-proxy kube-controller-manage:

    Oct  4 19:11:00 master kube-controller-manager: E1004 19:11:00.930275   22282 reflector.go:178] k8s.io/client-go/metadata/metadatainformer/informer.go:90: Failed to list *v1.PartialObjectMetadata: the server could not find the requested resource
    

    节点2的kubelet日志: 

    Oct  4 19:11:10 node2 kubelet: E1004 19:11:10.285790   31170 pod_workers.go:191] Error syncing pod 84a93201-5bee-4a40-85c1-b581c1faefa7 ("calico-kube-controllers-57546b46d6-sf26n_kube-system(84a93201-5bee-4a40-85c1-b581c1faefa7)"), skipping: failed to "KillPodSandbox" for "84a93201-5bee-4a40-85c1-b581c1faefa7" with KillPodSandboxError: "rpc error: code = Unknown desc = networkPlugin cni failed to teardown pod \"calico-kube-controllers-57546b46d6-sf26n_kube-system\" network: error getting ClusterInformation: Get https://[10.0.0.1]:443/apis/crd.projectcalico.org/v1/clusterinformations/default: dial tcp 10.0.0.1:443: i/o timeout"
     

    节点2的kube-proxy日志:

    E1004 18:54:49.280262   16325 reflector.go:382] k8s.io/client-go/informers/factory.go:135: Failed to watch *v1.Service: Get https://192.168.217.16:6443/api/v1/services?allowWatchBookmarks=true&labelSelector=%21service.kubernetes.io%2Fheadless%2C%21service.kubernetes.io%2Fservice-proxy-name&resourceVersion=64344&timeout=6m50s&timeoutSeconds=410&watch=true: dial tcp 192.168.217.16:6443: connect: connection refused
    E1004 18:54:49.280330   16325 reflector.go:382] k8s.io/client-go/informers/factory.go:135: Failed to watch *v1.Endpoints: Get https://192.168.217.16:6443/api/v1/endpoints?allowWatchBookmarks=true&labelSelector=%21service.kubernetes.io%2Fheadless%2C%21service.kubernetes.io%2Fservice-proxy-name&resourceVersion=64886&timeout=8m58s&timeoutSeconds=538&watch=true: dial tcp 192.168.217.16:6443: connect: connection r

    整个pod都不正常,calico-node反复重启,可以观察到master是正常的,剩下的都不正常。:

    1. [root@k8s-master cfg]# k get po -A -owide
    2. NAMESPACE NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
    3. kube-system calico-kube-controllers-57546b46d6-sf26n 0/1 ContainerCreating 0 12m k8s-node2
    4. kube-system calico-node-fskfk 0/1 CrashLoopBackOff 7 12m 192.168.217.18 k8s-node2
    5. kube-system calico-node-gbv9d 1/1 Running 0 12m 192.168.217.16 k8s-master
    6. kube-system calico-node-vb88h 0/1 Error 7 12m 192.168.217.17 k8s-node1
    7. kube-system coredns-76648cbfc9-8f45v 1/1 Running 2 3h36m 10.244.235.193 k8s-master

    错误pod的日志:

      ----     ------                  ----       ----                -------
      Normal   Scheduled                default-scheduler   Successfully assigned kube-system/calico-kube-controllers-57546b46d6-sf26n to k8s-node2
      Warning  FailedCreatePodSandBox  1s         kubelet, k8s-node2  Failed to create pod sandbox: rpc error: code = Unknown desc = [failed to set up sandbox container "0ca9dd9a4cf391dec3163d80e50fcb6b6424c7d93f245dc5b4f011eefed53375" network for pod "calico-kube-controllers-57546b46d6-sf26n": networkPlugin cni failed to set up pod "calico-kube-controllers-57546b46d6-sf26n_kube-system" network: error getting ClusterInformation: Get https://[10.0.0.1]:443/apis/crd.projectcalico.org/v1/clusterinformations/default: dial tcp 10.0.0.1:443: i/o timeout, failed to clean up sandbox container "0ca9dd9a4cf391dec3163d80e50fcb6b6424c7d93f245dc5b4f011eefed53375" network for pod "calico-kube-controllers-57546b46d6-sf26n": networkPlugin cni failed to teardown pod "calico-kube-controllers-57546b46d6-sf26n_kube-system" network: error getting ClusterInformation: Get https://[10.0.0.1]:443/apis/crd.projectcalico.org/v1/clusterinformations/default: dial tcp 10.0.0.1:443: i/o timeout]
      Normal   SandboxChanged          0s         kubelet, k8s-node2  Pod sandbox changed, it will be killed and re-created.

     

    故障排除经过:

    在node2节点,Telnet 10.0.0.1 443 确实不通,检查防火墙是否关闭,查看apiserver服务,看服务基本正常,然后所有节点所有服务都重启了,仍然报错。 

    后面翻找各个服务的配置文件,看是不是配置问题,结果发现问题:

    1. [root@k8s-master cfg]# grep -r -i "10.244" ./
    2. ./calico.yaml: value: "10.244.0.0/16"
    3. ./kube-flannel.yml: "Network": "10.244.0.0/16",
    4. ./kube-controller-manager.conf:--cluster-cidr=10.244.0.0/16 \
    5. ./kube-proxy-config.yml:clusterCIDR: 10.0.0.0/16

    发现kube-proxy 的cidr是10.0.0.0,和controller-manager是不一样的,遂改之(修改kube-proxy-config.yml),并重启kube-proxy和kubelet服务(真正的一字之差啊!!!!!):

    1. [root@k8s-master cfg]# grep -r -i "10.244" ./
    2. ./calico.yaml: value: "10.244.0.0/16"
    3. ./kube-flannel.yml: "Network": "10.244.0.0/16",
    4. ./kube-controller-manager.conf:--cluster-cidr=10.244.0.0/16 \
    5. ./kube-proxy-config.yml:clusterCIDR: 10.244.0.0/16

    再次查看日志没有问题,pod也正常了:

    1. [root@k8s-master cfg]# k get po -A -owide
    2. NAMESPACE NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
    3. kube-system calico-kube-controllers-57546b46d6-9xt7k 1/1 Running 0 32m 10.244.36.66 k8s-node1
    4. kube-system calico-node-5tzdt 1/1 Running 0 40m 192.168.217.16 k8s-master
    5. kube-system calico-node-pllx6 1/1 Running 4 40m 192.168.217.17 k8s-node1
    6. kube-system calico-node-tpjc9 1/1 Running 4 40m 192.168.217.18 k8s-node2
    7. kube-system coredns-76648cbfc9-8f45v 1/1 Running 2 4h18m 10.244.235.193 k8s-master

    整个世界终于清静了!!!!!!!!!!!!!

    十三,

    故障现象:

    部署Metrics server服务时,发现pod正常启动,但无法使用kubectl top node 命令获取信息,报错:

    1. [root@k8s-master kis]# k top node
    2. Error from server (ServiceUnavailable): the server is currently unable to handle the request (get nodes.metrics.k8s.io)

    查看kube-controller-manager 服务,报错如下:

    1. [root@k8s-master kis]# systemctl status kube-controller-manager.service -l
    2. ● kube-controller-manager.service - Kubernetes Controller Manager
    3. Loaded: loaded (/usr/lib/systemd/system/kube-controller-manager.service; enabled; vendor preset: disabled)
    4. Active: active (running) since Tue 2022-10-04 22:01:55 CST; 35min ago
    5. Docs: https://github.com/kubernetes/kubernetes
    6. Main PID: 757 (kube-controller)
    7. Memory: 115.1M
    8. CGroup: /system.slice/kube-controller-manager.service
    9. └─757 /opt/kubernetes/bin/kube-controller-manager --logtostderr=false --v=2 --log-dir=/opt/kubernetes/logs --leader-elect=true --master=127.0.0.1:8080 --bind-address=127.0.0.1 --allocate-node-cidrs=true --cluster-cidr=10.244.0.0/16 --service-cluster-ip-range=10.0.0.0/16 --cluster-signing-cert-file=/opt/kubernetes/ssl/ca.pem --cluster-signing-key-file=/opt/kubernetes/ssl/ca-key.pem --root-ca-file=/opt/kubernetes/ssl/ca.pem --service-account-private-key-file=/opt/kubernetes/ssl/ca-key.pem --experimental-cluster-signing-duration=87600h0m0s
    10. Oct 04 22:32:23 k8s-master kube-controller-manager[757]: E1004 22:32:23.089005 757 resource_quota_controller.go:408] unable to retrieve the complete list of server APIs: metrics.k8s.io/v1beta1: the server is currently unable to handle the request
    11. Oct 04 22:32:53 k8s-master kube-controller-manager[757]: E1004 22:32:53.842126 757 resource_quota_controller.go:408] unable to retrieve the complete list of server APIs: metrics.k8s.io/v1beta1: the server is currently unable to handle the request

    其它几个服务也有报类似错误,系统日志如下(江山一片红啊啊啊啊 ):

    1. Oct 4 23:20:03 master kube-apiserver: E1004 23:20:03.661075 27870 authentication.go:53] Unable to authenticate the request due to an error: [invalid bearer token, Token has been invalidated]
    2. Oct 4 23:20:03 master kubelet: E1004 23:20:03.665400 1343 cni.go:385] Error deleting kube-system_calico-kube-controllers-57546b46d6-zq5ds/a3070b42be0a83c747fcb5fc0e9b8332ee8258c7d4fbde654fc2025e66b98502 from network calico/k8s-pod-network: error getting ClusterInformation: connection is unauthorized: Unauthorized
    3. Oct 4 23:20:03 master kubelet: E1004 23:20:03.666450 1343 remote_runtime.go:128] StopPodSandbox "a3070b42be0a83c747fcb5fc0e9b8332ee8258c7d4fbde654fc2025e66b98502" from runtime service failed: rpc error: code = Unknown desc = networkPlugin cni failed to teardown pod "calico-kube-controllers-57546b46d6-zq5ds_kube-system" network: error getting ClusterInformation: connection is unauthorized: Unauthorized
    4. Oct 4 23:20:03 master kubelet: E1004 23:20:03.666506 1343 kuberuntime_manager.go:895] Failed to stop sandbox {"docker" "a3070b42be0a83c747fcb5fc0e9b8332ee8258c7d4fbde654fc2025e66b98502"}
    5. Oct 4 23:20:03 master kubelet: E1004 23:20:03.666567 1343 kuberuntime_manager.go:674] killPodWithSyncResult failed: failed to "KillPodSandbox" for "556b1eeb-27a4-4b3d-bbc5-6bc9b172dced" with KillPodSandboxError: "rpc error: code = Unknown desc = networkPlugin cni failed to teardown pod \"calico-kube-controllers-57546b46d6-zq5ds_kube-system\" network: error getting ClusterInformation: connection is unauthorized: Unauthorized"
    6. Oct 4 23:20:03 master kubelet: E1004 23:20:03.666598 1343 pod_workers.go:191] Error syncing pod 556b1eeb-27a4-4b3d-bbc5-6bc9b172dced ("calico-kube-controllers-57546b46d6-zq5ds_kube-system(556b1eeb-27a4-4b3d-bbc5-6bc9b172dced)"), skipping: failed to "KillPodSandbox" for "556b1eeb-27a4-4b3d-bbc5-6bc9b172dced" with KillPodSandboxError: "rpc error: code = Unknown desc = networkPlugin cni failed to teardown pod \"calico-kube-controllers-57546b46d6-zq5ds_kube-system\" network: error getting ClusterInformation: connection is unauthorized: Unauthorized"
    7. Oct 4 23:20:04 master dockerd: time="2022-10-04T23:20:04.082579376+08:00" level=info msg="shim reaped" id=82ad6e61e556d31761bef3ebb390519e747baf37a5bb10c614cac79746ec1600
    8. Oct 4 23:20:04 master dockerd: time="2022-10-04T23:20:04.093751545+08:00" level=info msg="ignoring event" module=libcontainerd namespace=moby topic=/tasks/delete type="*events.TaskDelete"
    9. Oct 4 23:20:04 master dockerd: time="2022-10-04T23:20:04.663889369+08:00" level=info msg="shim containerd-shim started" address="/containerd-shim/moby/25bd5dce993e46c58af335e91dc01e5fabf9c695ee1fb3a48c669c181e243878/shim.sock" debug=false pid=8400
    10. Oct 4 23:20:04 master systemd: Started libcontainer container 25bd5dce993e46c58af335e91dc01e5fabf9c695ee1fb3a48c669c181e243878.
    11. Oct 4 23:20:04 master systemd: Starting libcontainer container 25bd5dce993e46c58af335e91dc01e5fabf9c695ee1fb3a48c669c181e243878.
    12. Oct 4 23:20:04 master dockerd: time="2022-10-04T23:20:04.929696386+08:00" level=info msg="shim reaped" id=25bd5dce993e46c58af335e91dc01e5fabf9c695ee1fb3a48c669c181e243878
    13. Oct 4 23:20:04 master dockerd: time="2022-10-04T23:20:04.940353322+08:00" level=info msg="ignoring event" module=libcontainerd namespace=moby topic=/tasks/delete type="*events.TaskDelete"
    14. Oct 4 23:20:05 master dockerd: time="2022-10-04T23:20:05.689201068+08:00" level=info msg="shim containerd-shim started" address="/containerd-shim/moby/06f737ae5e9666987688a77122c768679ce4708d8bd21a718633f9720f8a0146/shim.sock" debug=false pid=8470
    15. Oct 4 23:20:05 master systemd: Started libcontainer container 06f737ae5e9666987688a77122c768679ce4708d8bd21a718633f9720f8a0146.
    16. Oct 4 23:20:05 master systemd: Starting libcontainer container 06f737ae5e9666987688a77122c768679ce4708d8bd21a718633f9720f8a0146.
    17. Oct 4 23:20:10 master kube-apiserver: E1004 23:20:10.456368 27870 available_controller.go:420] v1beta1.metrics.k8s.io failed with: failing or missing response from https://10.0.158.133:443/apis/metrics.k8s.io/v1beta1: Get https://10.0.158.133:443/apis/metrics.k8s.io/v1beta1: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)

    Metrics server是部署在node2节点的,查看pod 的IP,在master节点ping此IP不通。

    解决方案:

    通过以上日志,可以看出现在的嫌疑最大的就是calico这个网络插件了,master上ping了几个节点的IP都是不通,因此,查看calico的部署文件,发现使用的是vxlan模式:

    1. typha_service_name: "none"
    2. # Configure the backend to use.
    3. calico_backend: "vxlan"

    因此,将calico_backend 的值修改为bird,并重新部署即可。当然,calico网络的bgp模式也会造成节点之间的pod互相隔离ping不通的问题,使用calico网络的crosssubnet网络也可修复此问题,或者将Metrics server和master也就是apiserver部署在一起也可以勉强使用(这个方法我试用了,是没有问题的,长期来说就不一定了)。

    十四,

    故障前情:

    etcd集群感觉卡顿,kubernetes集群总是一卡一卡的那种感觉,各种操作都不流畅。在排查别的问题的时候打开系统日志发现了一抹红色:

    etcd: read-only range request "key:\"/registry/health\" " with result "range_response_count:0 size:6

    关联紧密的日志部分截取(日志是192.168.217.20服务器的,此服务器目前是leader):

    1. Nov 1 22:55:27 master2 etcd: failed to send out heartbeat on time (exceeded the 100ms timeout for 43.143421ms, to f5b8cb45a0dcf520)
    2. Nov 1 22:55:27 master2 etcd: server is likely overloaded
    3. Nov 1 22:55:27 master2 etcd: failed to send out heartbeat on time (exceeded the 100ms timeout for 43.235065ms, to 3d70d11f824a5d8f)
    4. Nov 1 22:55:27 master2 etcd: server is likely overloaded
    5. Nov 1 22:55:36 master2 etcd: read-only range request "key:\"/registry/leases/kube-system/kube-scheduler\" " with result "range_response_count:1 size:483" took too long (102.762895ms) to execute
    6. Nov 1 22:55:42 master2 etcd: request "header: txn: success:> failure:<>>" with result "size:18" took too long (122.623655ms) to execute
    7. Nov 1 22:55:42 master2 etcd: read-only range request "key:\"/registry/health\" " with result "range_response_count:0 size:6" took too long (103.679383ms) to execute

    故障分析:

    这里面有一段关键的话:

    此句说的是发送心跳失败了,100ms的时间不够,是发送到192.168.217.20这个服务器。

    master2 etcd: failed to send out heartbeat on time (exceeded the 100ms timeout for 43.143421ms, to f5b8cb45a0dcf520

    1. [root@master2 ~]# etct_serch member list -w table
    2. +------------------+---------+--------+-----------------------------+-----------------------------+------------+
    3. | ID | STATUS | NAME | PEER ADDRS | CLIENT ADDRS | IS LEARNER |
    4. +------------------+---------+--------+-----------------------------+-----------------------------+------------+
    5. | 3d70d11f824a5d8f | started | etcd-1 | https://192.168.217.19:2380 | https://192.168.217.19:2379 | false |
    6. | ef2fee107aafca91 | started | etcd-2 | https://192.168.217.20:2380 | https://192.168.217.20:2379 | false |
    7. | f5b8cb45a0dcf520 | started | etcd-3 | https://192.168.217.21:2380 | https://192.168.217.21:2379 | false |
    8. +------------------+---------+--------+-----------------------------+-----------------------------+------------+

    因此,此故障的原因应该是etcd集群心跳和选主的时间设置过小造成的。

    故障解决:

    修改etcd集群的配置文件,三个节点都修改(heartbeat-interval和election-timeout默认的值分别是100毫秒和1000毫秒),修改完毕后重启所有etcd服务:

    1. [root@master2 ~]# cat /usr/lib/systemd/system/etcd.service
    2. [Unit]
    3. Description=Etcd Server
    4. After=network.target
    5. After=network-online.target
    6. Wants=network-online.target
    7. [Service]
    8. Type=notify
    9. EnvironmentFile=/opt/etcd/cfg/etcd.conf
    10. ExecStart=/opt/etcd/bin/etcd \
    11. --cert-file=/opt/etcd/ssl/server.pem \
    12. --key-file=/opt/etcd/ssl/server-key.pem \
    13. --peer-cert-file=/opt/etcd/ssl/server.pem \
    14. --peer-key-file=/opt/etcd/ssl/server-key.pem \
    15. --trusted-ca-file=/opt/etcd/ssl/ca.pem \
    16. --peer-trusted-ca-file=/opt/etcd/ssl/ca.pem
    17. --wal-dir=/var/lib/etcd \ #快照日志路径
    18. --snapshot-count=50000 \ #最大快照次数,指定有多少事务被提交时,触发截取快照保存到磁盘,释放wal日志,默认值100000
    19. --auto-compaction-retention=1 \ #首次压缩周期为1小时,后续压缩周期为当前值的10%,也就是每隔6分钟压缩一次
    20. --auto-compaction-mode=periodic \ #周期性压缩
    21. --max-request-bytes=$((10*1024*1024)) \ #请求的最大字节数,默认一个key为1.5M,官方推荐最大为10M
    22. --quota-backend-bytes=$((8*1024*1024*1024)) \
    23. --heartbeat-interval="5000" \
    24. --election-timeout="25000"
    25. Restart=on-failure
    26. LimitNOFILE=65536
    27. [Install]
    28. WantedBy=multi-user.target

    稍微总结一哈:

    因为部署的kubernetes集群是etcd和apiserver同节点,并且是三master的节点,因此造成这个现象,apiserver和etcd都需要抢占网络资源这是一个根本性的原因,因此,如果是在实际的生产中,etcd集群最好是单独部署不和apiserver混合,两个参数如果默认,正常的情况下应该是够用的。

    网络卡顿还是需要从根源上找出问题,调参只是权宜之计。两个参数如果要设置,相差五倍即可。

    十五,

    故障前情:

    安装flannel插件的时候报错,pod一直不是running状态,集群是minikube:

    Error registering network: failed to acquire lease: node "node3" pod cidr not assigned

    故障分析:

    以上报错的意思是找不到pod cidr因此无法部署应用,minikube底层其实用的就是kubeadm,也就是说它的静态pod的配置文件和kubeadm是一样的,都放置在/etc/kubernetes/manifests这个目录下。因此,查看kube-controller-manager文件,发现确实没有pod cidr的定义。而我用的flannel是使用的默认网段10.244.0.0.。

    很奇怪,初始化命令指定了cidr,但配置文件里面没有:

    初始化命令:

    1. minikube start \
    2. --extra-config=controller-manager.allocate-node-cidrs=true \
    3. --extra-config=controller-manager.cluster-cidr=10.244.0.0/16 \
    4. --extra-config=kubelet.network-plugin=cni \
    5. --extra-config=kubelet.pod-cidr=10.244.0.0/16 \
    6. --network-plugin=cni \
    7. --kubernetes-version=1.18.8 \
    8. --vm-driver=none

    解决方案:

    编辑/etc/kubernetes/manifests/kube-controller-manager 文件,添加如下三行:

    1. - --allocate-node-cidrs=true
    2. - --cluster-cidr=10.244.0.0/16
    3. - --service-cluster-ip-range=10.96.0.0/12

    仔细排查了一下,上面的初始化命令也是有问题的,正确的初始化文件应该是这样的,此配置会安装flannel,省的使用yaml文件;

    1. minikube start pod-network-cidr='10.244.0.0/16'\
    2. --extra-config=kubelet.pod-cidr=10.244.0.0/16 \
    3. --network-plugin=cni \
    4. --image-repository='registry.aliyuncs.com/google_containers' \
    5. --cni=flannel \
    6. --apiserver-ips=192.168.217.23 \
    7. --kubernetes-version=1.18.8 \
    8. --vm-driver=none

    稍等片刻后,在重新部署flannel,可以看到flannel恢复正常了。

    1. [root@node3 manifests]# kubectl get po -A -owide
    2. NAMESPACE NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
    3. kube-system coredns-66bff467f8-9glkl 0/1 Running 0 58m 10.244.0.3 node3
    4. kube-system etcd-node3 1/1 Running 0 86m 192.168.217.23 node3
    5. kube-system kube-apiserver-node3 1/1 Running 0 86m 192.168.217.23 node3
    6. kube-system kube-controller-manager-node3 1/1 Running 0 15m 192.168.217.23 node3
    7. kube-system kube-flannel-ds-thjml 1/1 Running 9 80m 192.168.217.23 node3
    8. kube-system kube-proxy-j6j8c 1/1 Running 0 6m57s 192.168.217.23 node3
    9. kube-system kube-scheduler-node3 1/1 Running 0 11m 192.168.217.23 node3
    10. kube-system storage-provisioner 1/1 Running 0 86m 192.168.217.23 node3

    查看flannel的pod的日志:

    1. [root@node3 manifests]# kubectl logs kube-flannel-ds-thjml -n kube-system
    2. 略略略
    3. I1102 12:25:02.177604 1 iptables.go:155] Adding iptables rule: -s 10.244.0.0/16 -j ACCEPT
    4. I1102 12:25:02.178853 1 iptables.go:167] Deleting iptables rule: ! -s 10.244.0.0/16 -d 10.244.0.0/16 -j MASQUERADE --random-fully
    5. I1102 12:25:02.276100 1 iptables.go:155] Adding iptables rule: -d 10.244.0.0/16 -j ACCEPT
    6. I1102 12:25:02.276598 1 iptables.go:155] Adding iptables rule: -s 10.244.0.0/16 -d 10.244.0.0/16 -j RETURN
    7. I1102 12:25:02.375648 1 iptables.go:155] Adding iptables rule: -s 10.244.0.0/16 ! -d 224.0.0.0/4 -j MASQUERADE --random-fully
    8. I1102 12:25:02.379296 1 iptables.go:155] Adding iptables rule: ! -s 10.244.0.0/16 -d 10.244.0.0/24 -j RETURN
    9. I1102 12:25:02.476040 1 iptables.go:155] Adding iptables rule: ! -s 10.244.0.0/16 -d 10.244.0.0/16 -j MASQUERADE --random-fully

    可以看到flannel插件正常了。虚拟网卡也出现了,子网配置文件也自动生成了:

    1. [root@node3 manifests]# ls /run/flannel/subnet.env
    2. /run/flannel/subnet.env
    3. [root@node3 manifests]# ip a
    4. 1: lo: mtu 65536 qdisc noqueue state UNKNOWN qlen 1000
    5. link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    6. inet 127.0.0.1/8 scope host lo
    7. valid_lft forever preferred_lft forever
    8. inet6 ::1/128 scope host
    9. valid_lft forever preferred_lft forever
    10. 2: ens33: mtu 1500 qdisc pfifo_fast state UNKNOWN qlen 1000
    11. link/ether 00:0c:29:70:12:12 brd ff:ff:ff:ff:ff:ff
    12. inet 192.168.217.23/24 brd 192.168.217.255 scope global ens33
    13. valid_lft forever preferred_lft forever
    14. inet6 fe80::20c:29ff:fe70:1212/64 scope link
    15. valid_lft forever preferred_lft forever
    16. 3: docker0: mtu 1500 qdisc noqueue state DOWN
    17. link/ether 02:42:59:55:e5:7f brd ff:ff:ff:ff:ff:ff
    18. inet 172.17.0.1/16 brd 172.17.255.255 scope global docker0
    19. valid_lft forever preferred_lft forever
    20. inet6 fe80::42:59ff:fe55:e57f/64 scope link
    21. valid_lft forever preferred_lft forever
    22. 12: flannel.1: mtu 1450 qdisc noqueue state UNKNOWN
    23. link/ether 2e:b4:f1:da:9b:d9 brd ff:ff:ff:ff:ff:ff
    24. inet 10.244.0.0/32 brd 10.244.0.0 scope global flannel.1
    25. valid_lft forever preferred_lft forever
    26. inet6 fe80::2cb4:f1ff:feda:9bd9/64 scope link
    27. valid_lft forever preferred_lft forever

    报错解决了!!!!!!!!!!!!

    十六,

    故障前情:

    查看pod,发现coredns不正常,是running状态,但无法使用:

    1. [root@node3 ~]# kubectl get po -A -owide
    2. NAMESPACE NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
    3. kube-system coredns-7ff77c879f-55z6k 0/1 Running 0 5m49s 10.244.0.4 node3

    查看此pod的日志:

    1. [INFO] plugin/ready: Still waiting on: "kubernetes"
    2. E1102 15:04:04.718632 1 reflector.go:153] pkg/mod/k8s.io/client-go@v0.17.2/tools/cache/reflector.go:105: Failed to list *v1.Service: Get https://10.96.0.1:443/api/v1/services?limit=500&resourceVersion=0: dial tcp 10.96.0.1:443: connect: no route to host
    3. E1102 15:04:05.719791 1 reflector.go:153] pkg/mod/k8s.io/client-go@v0.17.2/tools/cache/reflector.go:105: Failed to list *v1.Service: Get https://10.96.0.1:443/api/v1/services?limit=500&resourceVersion=0: dial tcp 10.96.0.1:443: connect: no route to host
    4. E1102 15:04:06.721905 1 reflector.go:153] pkg/mod/k8s.io/client-go@v0.17.2/tools/cache/reflector.go:105: Failed to list *v1.Service: Get https://10.96.0.1:443/api/v1/services?limit=500&resourceVersion=0: dial tcp 10.96.0.1:443: connect: no route to host
    5. E1102 15:04:07.723040 1 reflector.go:153] pkg/mod/k8s.io/client-go@v0.17.2/tools/cache/reflector.go:105: Failed to list *v1.Service: Get https://10.96.0.1:443/api/v1/services?limit=500&resourceVersion=0: dial tcp 10.96.0.1:443: connect: no route to host
    6. E1102 15:04:08.724991 1 reflector.go:153] pkg/mod/k8s.io/client-go@v0.17.2/tools/cache/reflector.go:105: Failed to list *v1.Service: Get https://10.96.0.1:443/api/v1/services?limit=500&resourceVersion=0: dial tcp 10.96.0.1:443: connect: no route to host

    关键的地方:dial tcp 10.96.0.1:443: connect: no route to host

    故障分析:

    安装telnet,telnet此IP+端口:

    1. [root@node3 ~]# telnet 10.96.0.1 443
    2. Trying 10.96.0.1...
    3. Connected to 10.96.0.1.
    4. Escape character is '^]'.

    以上表示端口正常,也就是说端口是开放出来的,这就让人非常迷惑了。

    1. [root@node3 ~]# curl -k https://10.96.0.1:443
    2. curl: (7) Failed connect to 10.96.0.1:443; Connection refused

    以上命令表示超时,因为卡顿了几十秒才给这个报错,这个现象非常类似防火墙的问题。

    故障解决方案:

    停止防火墙,再次curl 发现恢复了(由于此网址是带证书的,而我没有带证书,自然是失败,但可以访问,只是权限问题。):

    1. [root@node3 ~]# systemctl stop firewalld
    2. [root@node3 ~]# curl -k https://10.96.0.1:443
    3. {
    4. "kind": "Status",
    5. "apiVersion": "v1",
    6. "metadata": {
    7. },
    8. "status": "Failure",
    9. "message": "forbidden: User \"system:anonymous\" cannot get path \"/\"",
    10. "reason": "Forbidden",
    11. "details": {
    12. },
    13. "code": 403

    再次查看pod,发现恢复正常了:

    1. }[root@node3 ~]#kubectl get po -A -owide
    2. NAMESPACE NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
    3. kube-system coredns-7ff77c879f-55z6k 1/1 Running 0 21m 10.244.0.4 node3

    查看pod,也正常了:

    1. I1102 15:13:35.813427 1 trace.go:116] Trace[1852186258]: "Reflector ListAndWatch" name:pkg/mod/k8s.io/client-go@v0.17.2/tools/cache/reflector.go:105 (started: 2022-11-02 15:13:08.790440794 +0000 UTC m=+301.345800939) (total time: 27.022935709s):
    2. Trace[1852186258]: [27.022876523s] [27.022876523s] Objects listed
    3. I1102 15:13:35.813725 1 trace.go:116] Trace[1616138287]: "Reflector ListAndWatch" name:pkg/mod/k8s.io/client-go@v0.17.2/tools/cache/reflector.go:105 (started: 2022-11-02 15:13:24.782953268 +0000 UTC m=+317.338313414) (total time: 11.030727079s):
    4. Trace[1616138287]: [11.030681851s] [11.030681851s] Objects listed

    coredns的功能测试也正常了:

    1. [root@node3 ~]# kubectl run -it --image busybox:1.28.3 dns-test --restart=Never --rm
    2. If you don't see a command prompt, try pressing enter.
    3. / # nslookup kubernetes
    4. Server: 10.96.0.10
    5. Address 1: 10.96.0.10 kube-dns.kube-system.svc.cluster.local
    6. Name: kubernetes
    7. Address 1: 10.96.0.1 kubernetes.default.svc.cluster.local

    故障修复了!!!!!!!!!!!!!!!!!!!!!!!!!

    十七,

    故障前情:

    daemon方式部署,无法apply -f 文件,总是报错如下:

    1. root@k8s-master:~# kubectl apply -f 4.yaml
    2. error: error validating "4.yaml": error validating data: [ValidationError(DaemonSet.status): missing required field "currentNumberScheduled" in io.k8s.api.apps.v1.DaemonSetStatus, ValidationError(DaemonSet.status): missing required field "numberMisscheduled" in io.k8s.api.apps.v1.DaemonSetStatus, ValidationError(DaemonSet.status): missing required field "desiredNumberScheduled" in io.k8s.api.apps.v1.DaemonSetStatus, ValidationError(DaemonSet.status): missing required field "numberReady" in io.k8s.api.apps.v1.DaemonSetStatus]; if you choose to ignore these errors, turn validation off with --validate=false

    故障分析:

    这个错误提示还是比较的明显,以上报错说的是dameset.status 错误,因此,打开部署文件,在末尾可以看到有 status: {} 完整文件如下:

    1. apiVersion: apps/v1
    2. kind: DaemonSet
    3. metadata:
    4. creationTimestamp: null
    5. labels:
    6. app: nginx
    7. name: nginx
    8. namespace: project-tiger
    9. spec:
    10. selector:
    11. matchLabels:
    12. app: nginx
    13. #strategy: {}
    14. template:
    15. metadata:
    16. labels:
    17. app: nginx
    18. spec:
    19. containers:
    20. - image: httpd:2.4-alpine
    21. name: nginx
    22. resources: {}
    23. status: {}

    解决方案:

    根据错误提示,要么增加currentNumberScheduled,numberReady,numberMisscheduled,desiredNumberScheduled等等以上错误提示的字段,要么删除status: {} 这两种方式。

    很明显,无需增加状态描述字段,因为我们主要是部署,因此,删除status: {} 即可。

    稍作小结:

    daemonset是由deployment控制器更改而来的,因此,通常是由命令生成的模板文件,status: {}是自动生成的,删除即可,和阑尾一样,基本没用

    十八,

    kubelet服务报错

    从系统日志里抓取的

    1. Jan 20 13:44:18 k8s-master kubelet[1210]: E0120 13:44:18.882767 1210 summary_sys_containers.go:47] "Failed to get system container stats" err="failed to get cgroup stats for \"/system.slice/docker.service\": failed to get container info for \"/system.slice/docker.service\": unknown container \"/system.slice/docker.service\"" containerName="/system.slice/docker.service"
    2. Jan 20 13:44:26 k8s-master kubelet[1210]: E0120 13:44:26.799795 1210 summary_sys_containers.go:82] "Failed to get system container stats" err="failed to get cgroup stats for \"/system.slice/docker.service\": failed to get container info for \"/system.slice/docker.service\": unknown container \"/system.slice/docker.service\"" containerName="/system.slice/docker.service"
    3. Jan 20 13:44:28 k8s-master kubelet[1210]: E0120 13:44:28.906733 1210 summary_sys_containers.go:47] "Failed to get system container stats" err="failed to get cgroup stats for \"/system.slice/docker.service\": failed to get container info for \"/system.slice/docker.service\": unknown container \"/system.slice/docker.service\"" containerName="/system.slice/docker.service"

    报错分析:

    原因是 kubelet 启动时,会执行节点资源统计,需要 systemd 中开启对应的选项

    解决方案:

    修改10-kubeadm.conf这个kubelet的配置文件,增加以下两行:

    1. CPUAccounting=true
    2. MemoryAccounting=true

    完整的配置文件是这样的:

    1. # Note: This dropin only works with kubeadm and kubelet v1.11+
    2. [Service]
    3. CPUAccounting=true
    4. MemoryAccounting=true
    5. Environment="KUBELET_KUBECONFIG_ARGS=--bootstrap-kubeconfig=/etc/kubernetes/bootstrap-kubelet.conf --kubeconfig=/etc/kubernetes/kubelet.conf"
    6. Environment="KUBELET_CONFIG_ARGS=--config=/var/lib/kubelet/config.yaml"
    7. # This is a file that "kubeadm init" and "kubeadm join" generates at runtime, populating the KUBELET_KUBEADM_ARGS variable dynamically
    8. EnvironmentFile=-/var/lib/kubelet/kubeadm-flags.env
    9. # This is a file that the user can use for overrides of the kubelet args as a last resort. Preferably, the user should use
    10. # the .NodeRegistration.KubeletExtraArgs object in the configuration files instead. KUBELET_EXTRA_ARGS should be sourced from this file.
    11. EnvironmentFile=-/etc/default/kubelet
    12. ExecStart=
    13. ExecStart=/usr/bin/kubelet $KUBELET_KUBECONFIG_ARGS $KUBELET_CONFIG_ARGS $KUBELET_KUBEADM_ARGS $KUBELET_EXTRA_ARGS

    重启kubelet,可以看到没有错误日志了:

    重启kubelet服务

    systemctl daemon-reload && systemctl restart kubelet

    再次查看系统日志,可以看到日志恢复正常了:

    1. "
    2. Jan 20 13:45:35 k8s-master kubelet[62687]: I0120 13:45:35.638874 62687 reconciler.go:225] "operationExecutor.VerifyControllerAttachedVolume started for volume \"kube-proxy\" (UniqueName: \"kubernetes.io/configmap/6decd6cc-a931-46b9-92fe-b3b1a03f9ea4-kube-proxy\") pod \"kube-proxy-5nj6l\" (UID: \"6decd6cc-a931-46b9-92fe-b3b1a03f9ea4\") "
    3. Jan 20 13:45:35 k8s-master kubelet[62687]: I0120 13:45:35.638918 62687 reconciler.go:225] "operationExecutor.VerifyControllerAttachedVolume started for volume \"host-local-net-dir\" (UniqueName: \"kubernetes.io/host-path/5ef5e743-ee71-4c80-a543-76e18a232a45-host-local-net-dir\") pod \"calico-node-4l4ll\" (UID: \"5ef5e743-ee71-4c80-a543-76e18a232a45\") "
    4. Jan 20 13:45:35 k8s-master kubelet[62687]: I0120 13:45:35.638951 62687 reconciler.go:157] "Reconciler: start to sync state"
    5. Jan 20 13:45:36 k8s-master kubelet[62687]: I0120 13:45:36.601380 62687 request.go:665] Waited for 1.151321922s due to client-side throttling, not priority and fair

    十九,

    安装nfs存储插件报错如下:

    1. Mounting arguments: -t nfs 192.168.123.11:/data/nfs-sc /var/lib/kubelet/pods/4a0ead87-4932-4a9a-9fc0-2b89aac94b1a/volumes/kubernetes.io~nfs/nfs-client-root
    2. Output: mount: wrong fs type, bad option, bad superblock on 192.168.123.11:/data/nfs-sc,
    3. missing codepage or helper program, or other error
    4. (for several filesystems (e.g. nfs, cifs) you might
    5. need a /sbin/mount.<type> helper program)
    6. In some cases useful info is found in syslog - try
    7. dmesg | tail or so.

    报错分析:

    缺少执行文件

    解决方案:

    yum install nfs-utils -y

    三个节点,只有主节点安装了,其他的节点也要安装这个~!~~~~~~~~~!!!!

  • 相关阅读:
    再谈Http和Https及TCP/UDP/IP协议分析,面试官都惊讶的网络见解
    虚拟内存、虚拟地址空间和物理地址空间(内存管理)
    Java+JSP+Mysql+Tomcat实现Web图书管理系统
    Ubuntu16.04上安装Docker
    PHP - 遇到的Bug - 总结
    Java学习笔记3.7.2 接口
    java基础之内部类[31]
    Transformers x SwanLab:可视化NLP模型训练
    java计算机毕业设计虚拟物品交易网站源码+系统+数据库+lw文档
    中级软件设计师考试(软考中级)设计模式分类及其典型特点
  • 原文地址:https://blog.csdn.net/alwaysbefine/article/details/126531104