本文永久链接: https://www.xtplayer.cn/rancher/waiting-for-node-to-register-either-cluster-is-not-ready-for-registering-or-etcd-and-controlplane-node-have-to-be-registered-first/

INFO: Environment: CATTLE_ADDRESS=10.1xx.xx.xx CATTLE_AGENT_CONNECT=true CATTLE_CA_CHECKSUM=99e6ccda7c91855xxxxxxxxx4f760c0278713b95b30ab0616b66df1a CATTLE_CLUSTER=false CATTLE_INTERNAL_ADDRESS= CATTLE_K8S_MANAGED=true CATTLE_NODE_NAME=cncxxxx060vl CATTLE_SERVER=https://rancher.xxxx.com
INFO: Using resolv.conf: nameserver 10.1xx.xx.xx nameserver 10.1xx.xx.xx search lnx.fxxxxx fmcxxxxx.cn
INFO: https://rancher.xxxx.com/ping is accessible
INFO: rancher.xxxx.com resolves to 10.1xx.xx.xx
INFO: Value from https://rancher.xxxx.com/v3/settings/cacerts is an x509 certificate
time="2022-09-17T06:10:47Z" level=info msg="Rancher agent version v2.4.8 is starting"
time="2022-09-17T06:10:47Z" level=info msg="Listening on /tmp/log.sock"
time="2022-09-17T06:10:47Z" level=info msg="Option customConfig=map[address:10.1xx.xx.xx internalAddress: label:map[] roles:[] taints:[]]"
time="2022-09-17T06:10:47Z" level=info msg="Option etcd=false"
time="2022-09-17T06:10:47Z" level=info msg="Option controlPlane=false"
time="2022-09-17T06:10:47Z" level=info msg="Option worker=false"
time="2022-09-17T06:10:47Z" level=info msg="Option requestedHostname=cncxxxx060vl"
time="2022-09-17T06:10:47Z" level=info msg="Connecting to wss://rancher.xxxx.com/v3/connect with token ks5rgcxxxxxxxpkb7nd2zj4qsk6snclcxqnn"
time="2022-09-17T06:10:47Z" level=info msg="Connecting to proxy" url="wss://rancher.xxxx.com/v3/connect"
time="2022-09-17T06:10:47Z" level=info msg="Waiting for node to register. Either cluster is not ready for registering or etcd and controlplane node have to be registered first"
time="2022-09-17T06:10:49Z" level=info msg="Waiting for node to register. Either cluster is not ready for registering or etcd and controlplane node have to be registered first"
time="2022-09-17T06:10:51Z" level=info msg="Waiting for node to register. Either cluster is not ready for registering or etcd and controlplane node have to be registered first"
time="2022-09-17T06:10:53Z" level=info msg="Waiting for node to register. Either cluster is not ready for registering or etcd and controlplane node have to be registered first"
time="2022-09-17T06:10:55Z" level=info msg="Waiting for node to register. Either cluster is not ready for registering or etcd and controlplane node have to be registered first"
time="2022-09-17T06:10:57Z" level=info msg="Waiting for node to register. Either cluster is not ready for registering or etcd and controlplane node have to be registered first"
time="2022-09-17T06:10:59Z" level=info msg="Waiting for node to register. Either cluster is not ready for registering or etcd and controlplane node have to be registered first"
time="2022-09-17T06:11:01Z" level=info msg="Waiting for node to register. Either cluster is not ready for registering or etcd and controlplane node have to be registered first"
time="2022-09-17T06:11:03Z" level=info msg="Waiting for node to register. Either cluster is not ready for registering or etcd and controlplane node have to be registered first"

如上日志所示,对于 rancher custom 集群,有时候在 node agent pod 中可以看到有 Waiting for node to register 的日志信息,出现这个日志后,说明当前节点没有正常注册到 rancher 中。

节点没有注册到 rancher 中,对于后期 custom 集群版本升级,这个节点上的基础组件将无法正常升级。虽然它没有正常注册到 rancher 中,但是它是正常注册到底层的 k8s 中,因此它不影响 k8s pod 创建、删除等操作。

问题处理

对于这个问题,最快捷的处理方法是删除节点,然后初始化节点之后重新添加到集群。但是对于有业务运行的生产环境,可能不能删除节点,那么只能通过以下方法手动处理。

  1. 执行以下命令,查看 cluster id、node id、node ip 之间的对应关系。根据报 Waiting for node to register 日志对应 pod 所在节点的 ip,找到相应的 cluster id、node id。

    kubectl get nodes.management.cattle.io -A \
    -o=custom-columns=NAMESPACE:.metadata.namespace,NAME:.metadata.name,IP:spec.customConfig.address,HostnameOverride:.spec.requestedHostname
  2. 然后找一个相同集群下正常节点和不正常节点,分别执行以下命令打印节点的配置 YAML。

    kubectl get nodes.management.cattle.io  -n c-xxx m-xxxx -oyaml
  3. 可以发现在 YAML 配置的结尾,正常节点有 rkeNode 配置,而异常节点没有,接下来的处理方法就是手动把 rkeNode 配置添加到异常节点配置上去。

    请参照正常节点的 rkeNode 配置,修改其中的 address、hostnameOverride、nodeName、role。如果添加节点时候设置的参数不同,那么此处的 rkeNode 配置也不一样。

    以下是一个正常节点 rkeNode 配置示例:

    rkeNode:
    address: 192.168.1.224
    hostnameOverride: alihost01
    nodeName: c-9p9ck:m-4cb5bfd0709c # cluster_id 和 node_id
    port: "22"
    role:
    - etcd
    - controlplane
    - worker
    user: root
  4. 执行以下命令编辑 node 资源

    kubectl edit nodes.management.cattle.io -n c-xxx m-xxxx