rke2 Maximum failure threshold exceeded for plan with checksum

本文永久链接: https://www.xtplayer.cn/rke2/maximum-failure-threshold-exceeded-for-plan-with-checksum/

问题描述

在 Rancher UI 中发现 master node 显示错误状态，提示：

Error applying plan -- check rancher-system-agent.service logs on node for more information

但执行 kubectl get node 命令显示该节点状态正常。

通过查看 rancher-system-agent 日志发现以下错误：

rancher-system-agent[3419230]: time="2026-01-06T15:09:38+08:00" level=error msg="[K8s] Maximum failure threshold exceeded for plan with checksum value of b53110a4c92ea7e89cc08a8a77d7683604b10ed61567edc90ed48e3274ba2cc6, (failures: 1, threshold: 1)"

影响范围

Rancher UI 显示节点错误状态
节点可能无法正常接收和执行业务部署计划
集群管理功能可能受限
但 Kubernetes 核心功能可能仍正常工作

根本原因分析

此错误表示 rancher-system-agent 在尝试应用某个配置计划时超出了最大失败阈值。可能的原因包括：

配置计划执行失败：Rancher 下发的配置计划无法在节点上成功执行
网络通信问题：节点与 Rancher server 之间的通信异常
资源冲突：配置计划中的资源与现有资源冲突
权限问题：rancher-system-agent 执行操作的权限不足
系统组件故障：相关系统服务异常

解决方案

方案 1：强制重新同步节点配置

# 删除并重新创建 cattle-cluster-agent(注意：删除 cattle-cluster-agent pod 后，rancher ui 上对应的下游集群会短时间内出现失联状态。)
kubectl delete pod -n cattle-system -l app=cattle-cluster-agent

# 在节点上重新启动 rancher-system-agent 服务
sudo systemctl daemon-reload
sudo systemctl restart rancher-system-agent

方案 2：创建下游集群 snapshot 备份

如果方案 1 操作后没有效果，可以尝试在 rancher ui 上创建下游集群的 snapshot 备份，备份时候会触发集群节点配置强制更新。