本文永久链接: https://www.xtplayer.cn/rancher/clusterunavailable-503-cluster-not-found/

如图所示,在升级到 rancher v2.5.16之后,可能会出现点击集群无法进入集群首页,rancher ui 一直卡住,接着页面右上角出现 ClusterUnavailable 503 cluster not found 的错误提示。

查看 rancher pod 可以看到以下的错误日志:

{"log":"2022/10/25 12:31:48 [ERROR] failed on subscribe storageClass: ClusterUnavailable 503: ClusterUnavailable 503: cluster not found\r\n","stream":"stdout","time":"2022-10-25T12:31:48.660186963Z"}
{"log":"2022/10/25 12:31:48 [ERROR] failed on subscribe apiService: ClusterUnavailable 503: ClusterUnavailable 503: cluster not found\r\n","stream":"stdout","time":"2022-10-25T12:31:48.660193393Z"}
{"log":"2022/10/25 12:31:48 [ERROR] failed on subscribe namespace: ClusterUnavailable 503: ClusterUnavailable 503: cluster not found\r\n","stream":"stdout","time":"2022-10-25T12:31:48.660195401Z"}
{"log":"2022/10/25 12:31:48 [ERROR] failed on subscribe persistentVolume: ClusterUnavailable 503: ClusterUnavailable 503: cluster not found\r\n","stream":"stdout","time":"2022-10-25T12:31:48.660197333Z"}

并且 rancher pod 日志中有以下的错误,频繁出现。

[ERROR] [secretmigrator] failed to migrate service account token secret for cluster c-qlmv9, will retry: Operation cannot be fulfilled on clusters.management.cattle.io "c-qlmv9": the object has been modified; please apply your changes to the latest version and try again

分析

在 v2.5.16 上有一个 cve (明文 serviceAccountToken)漏洞的修复,修复逻辑是把原来 clusters.management.cattle.io crd 中的 serviceAccountToken 字段的内容提取出来,然后用将 serviceAccountToken 中的内容创建为一个 secret。然后再把 clusters.management.cattle.io crd 中的 serviceAccountToken 字段替换为 serviceAccountTokenSecret 字段,serviceAccountTokenSecret 字段值为创建的 secret。

目前看,这里的修复逻辑与某些版本某些字段存在冲突,从而导致 clusters.management.cattle.io crd 中serviceAccountToken无法正常被替换为 serviceAccountTokenSecret。

处理方法

参考 issue https://github.com/rancher/rancher/issues/38699 ,可以使用 https://github.com/rancherlabs/support-tools/blob/master/rotate-tokens/rotate-tokens.sh 这个脚本去手动执行更新。