根据以上信息可以看出,在 UTC 0322 20:55 左右,出现了连接 API SERVER 超时的情况,接下来查看 API SERVER 服务日志。
kube-apiserver
E0321 22:47:41.289207 1 authentication.go:53] Unable to authenticate the request due to an error: [invalid bearer token, context canceled] E0321 22:47:48.503439 1 status.go:71] apiserver received an error that is not an metav1.Status: &errors.errorString{s:"context canceled"} I0321 22:47:48.503648 1 trace.go:116] Trace[1655495025]: "Get" url:/api/v1/namespaces/kube-system/endpoints/kube-scheduler,user-agent:kube-scheduler/v1.18.10 (linux/amd64) kubernetes/62876fc/leader-election,client:127.0.0.1 (started: 2021-03-21 22:47:38.517639233 +0000 UTC m=+1182359.771988386) (total time: 9.985984157s): . . . Trace[723022686]: [21.463074932s] [21.463070777s] About to write a response W0321 22:52:07.981746 1 lease.go:224] Resetting endpoints for master service "kubernetes" to [10.151.130.137 10.151.130.138 10.151.130.139] E0321 22:54:34.697599 1 status.go:71] apiserver received an error that is not an metav1.Status: rpctypes.EtcdError{code:0xe, desc:"etcdserver: leader changed"}
根据以上信息,可以确定因为 ETCD leader changed 导致 controller manager and scheduler unavailable。
ETCD 检查
在 API SERVER 日志中查看到了 etcdserver: leader changed 的提示,接下来检查 ETCD 服务日志来具体分析。
ETCD
2021-03-21 22:47:38.503291 W | rafthttp: lost the TCP streaming connection with peer 496ad58099e4ad83 (stream Message reader) 2021-03-21 22:47:38.503313 E | rafthttp: failed to read 496ad58099e4ad83 on stream Message (read tcp 10.151.130.138:51268->10.151.130.139:2380: i/o timeout) 2021-03-21 22:47:38.503317 I | rafthttp: peer 496ad58099e4ad83 became inactive (message send to peer failed) 2021-03-21 22:47:38.503628 W | rafthttp: lost the TCP streaming connection with peer ee17b2c8bf87d608 (stream Message reader) 2021-03-21 22:47:38.503641 E | rafthttp: failed to read ee17b2c8bf87d608 on stream Message (read tcp 10.151.130.138:34276->10.151.130.137:2380: i/o timeout) 2021-03-21 22:47:38.503645 I | rafthttp: peer ee17b2c8bf87d608 became inactive (message send to peer failed) 2021-03-21 22:47:38.517019 W | rafthttp: lost the TCP streaming connection with peer ee17b2c8bf87d608 (stream MsgApp v2 reader) 2021-03-21 22:47:38.521398 W | rafthttp: lost the TCP streaming connection with peer 496ad58099e4ad83 (stream MsgApp v2 reader) 2021-03-21 22:47:38.531432 I | rafthttp: peer ee17b2c8bf87d608 became active 2021-03-21 22:47:38.531455 I | rafthttp: established a TCP streaming connection with peer ee17b2c8bf87d608 (stream MsgApp v2 reader) 2021-03-21 22:47:38.531461 I | rafthttp: peer 496ad58099e4ad83 became active 2021-03-21 22:47:38.531474 I | rafthttp: established a TCP streaming connection with peer 496ad58099e4ad83 (stream MsgApp v2 reader) 2021-03-21 22:47:38.531599 I | rafthttp: established a TCP streaming connection with peer 496ad58099e4ad83 (stream Message reader) 2021-03-21 22:47:38.533011 I | rafthttp: established a TCP streaming connection with peer ee17b2c8bf87d608 (stream Message reader) 2021-03-21 22:47:38.631853 E | rafthttp: failed to write ee17b2c8bf87d608 on stream Message (write tcp 10.151.130.138:2380->10.151.130.137:36700: write: connection reset by peer) 2021-03-21 22:47:38.631869 I | rafthttp: peer ee17b2c8bf87d608 became inactive (message send to peer failed) 2021-03-21 22:47:38.632307 W | rafthttp: lost the TCP streaming connection with peer ee17b2c8bf87d608 (stream Message writer) 2021-03-21 22:47:38.634600 I | rafthttp: peer ee17b2c8bf87d608 became active 2021-03-21 22:47:40.880842 W | rafthttp: closed an existing TCP streaming connection with peer 496ad58099e4ad83 (stream Message writer) 2021-03-21 22:47:40.880863 I | rafthttp: established a TCP streaming connection with peer 496ad58099e4ad83 (stream Message writer) 2021-03-21 22:47:40.881069 W | rafthttp: closed an existing TCP streaming connection with peer ee17b2c8bf87d608 (stream MsgApp v2 writer) 2021-03-21 22:47:40.881086 I | rafthttp: established a TCP streaming connection with peer ee17b2c8bf87d608 (stream MsgApp v2 writer) 2021-03-21 22:47:40.882339 W | rafthttp: closed an existing TCP streaming connection with peer 496ad58099e4ad83 (stream MsgApp v2 writer) 2021-03-21 22:47:40.882354 I | rafthttp: established a TCP streaming connection with peer 496ad58099e4ad83 (stream MsgApp v2 writer) 2021-03-21 22:47:41.425089 I | rafthttp: established a TCP streaming connection with peer ee17b2c8bf87d608 (stream Message writer)
通过 ETCD 日志可以确定,在 UTC 2021-03-21 22:47:38 出现了 lost the TCP streaming connection with peer 。
E0322 20:48:29.475790 1 status.go:71] apiserver received an error that is not an metav1.Status: rpctypes.EtcdError{code:0xe, desc:"etcdserver: request timed out"} E0322 20:48:29.475844 1 status.go:71] apiserver received an error that is not an metav1.Status: rpctypes.EtcdError{code:0xe, desc:"etcdserver: request timed out"}
E0322 20:52:34.274857 1 status.go:71] apiserver received an error that is not an metav1.Status: rpctypes.EtcdError{code:0xe, desc:"etcdserver: leader changed"}
ETCD
2021-03-22 20:30:46.088174 I | mvcc: finished scheduled compaction at 23035178 (took 26.164333ms) 2021-03-22 20:34:52.330573 W | etcdserver: read-only range request "key:\"/registry/persistentvolumeclaims/default/nginx-logs-pvc\" " with result "range_response_count:1 size:1405" took too long (1.669619791s) to execute 2021-03-22 20:34:52.332649 W | etcdserver: read-only range request "key:\"/registry/services/endpoints/kube-system/kube-controller-manager\" " with result "range_response_count:1 size:615" took too long (1.471652409s) to execute 2021-03-22 20:34:52.333036 W | etcdserver: read-only range request "key:\"/registry/services/endpoints/kube-system/kube-scheduler\" " with result "range_response_count:1 size:588" took too long (217.627474ms) to execute 2021-03-22 20:34:52.333058 W | etcdserver: read-only range request "key:\"/registry/resourcequotas/\" range_end:\"/registry/resourcequotas0\" count_only:true " with result "range_response_count:0 size:9" took too long (630.275951ms) to execute 2021-03-22 20:34:52.333081 W | etcdserver: read-only range request "key:\"/registry/services/endpoints/local-path-storage/rancher.io-local-path\" " with result "range_response_count:1 size:653" took too long (394.73969ms) to execute 2021-03-22 20:34:52.333308 W | etcdserver: read-only range request "key:\"/registry/services/specs/default/kubernetes\" " with result "range_response_count:1 size:616" took too long (430.624112ms) to execute 2021-03-22 20:35:46.067308 I | mvcc: store.index: compact 23036390 2021-03-22 20:35:46.093276 I | mvcc: finished scheduled compaction at 23036390 (took 25.599134ms) 2021-03-22 20:40:46.077355 I | mvcc: store.index: compact 23037604 2021-03-22 20:40:46.104071 I | mvcc: finished scheduled compaction at 23037604 (took 26.305942ms) 2021-03-22 20:43:58.958177 W | etcdserver: read-only range request "key:\"/registry/services/endpoints/kube-system/kube-controller-manager\" " with result "range_response_count:1 size:615" took too long (702.184455ms) to execute 2021-03-22 20:43:58.959252 W | etcdserver: read-only range request "key:\"/registry/leases/kube-system/kube-scheduler\" " with result "range_response_count:1 size:491" took too long (687.254ms) to execute 2021-03-22 20:43:58.959275 W | etcdserver: read-only range request "key:\"/registry/leases/kube-system/kube-controller-manager\" " with result "range_response_count:1 size:518" took too long (651.871463ms) to execute 2021-03-22 20:43:58.959363 W | etcdserver: read-only range request "key:\"/registry/events/\" range_end:\"/registry/events0\" count_only:true " with result "range_response_count:0 size:9" took too long (647.299563ms) to execute 2021-03-22 20:45:46.088764 I | mvcc: store.index: compact 23038829 2021-03-22 20:45:46.117371 I | mvcc: finished scheduled compaction at 23038829 (took 28.090255ms) 2021-03-22 20:48:13.545676 W | rafthttp: lost the TCP streaming connection with peer 4bd420be9a532df8 (stream Message reader) 2021-03-22 20:48:13.545703 E | rafthttp: failed to read 4bd420be9a532df8 on stream Message (read tcp 10.151.130.137:54864->10.151.130.138:2380: i/o timeout) 2021-03-22 20:48:13.545707 I | rafthttp: peer 4bd420be9a532df8 became inactive (message send to peer failed) 2021-03-22 20:48:13.549798 W | rafthttp: lost the TCP streaming connection with peer 4bd420be9a532df8 (stream MsgApp v2 reader) raft2021/03/22 20:48:14 INFO: ee17b2c8bf87d608 [term: 164] received a MsgVote message with higher term from 496ad58099e4ad83 [term: 165] raft2021/03/22 20:48:14 INFO: ee17b2c8bf87d608 became follower at term 165 raft2021/03/22 20:48:14 INFO: ee17b2c8bf87d608 [logterm: 164, index: 29128056, vote: 0] cast MsgVote for 496ad58099e4ad83 [logterm: 164, index: 29128056] at term 165 raft2021/03/22 20:48:14 INFO: raft.node: ee17b2c8bf87d608 lost leader 4bd420be9a532df8 at term 165 raft2021/03/22 20:48:14 INFO: raft.node: ee17b2c8bf87d608 elected leader 496ad58099e4ad83 at term 165 2021-03-22 20:48:14.475506 W | etcdserver: read-only range request "key:\"/registry/services/endpoints/kube-system/kube-controller-manager\" " with result "error:etcdserver: leader changed" took too long (5.299921652s) to execute 2021-03-22 20:48:14.477440 W | etcdserver: read-only range request "key:\"/registry/services/endpoints/local-path-storage/rancher.io-local-path\" " with result "range_response_count:1 size:653" took too long (4.456789691s) to execute 2021-03-22 20:48:14.477462 W | etcdserver: read-only range request "key:\"/registry/storageclasses/\" range_end:\"/registry/storageclasses0\" count_only:true " with result "range_response_count:0 size:9" took too long (3.623175886s) to execute 2021-03-22 20:48:14.477539 W | etcdserver: read-only range request "key:\"/registry/networkpolicies/\" range_end:\"/registry/networkpolicies0\" count_only:true " with result "range_response_count:0 size:7" took too long (478.668022ms) to execute 2021-03-22 20:48:14.477622 W | etcdserver: read-only range request "key:\"/registry/rolebindings/\" range_end:\"/registry/rolebindings0\" count_only:true " with result "range_response_count:0 size:9" took too long (1.244980828s) to execute 2021-03-22 20:48:14.477732 W | etcdserver: read-only range request "key:\"/registry/apiregistration.k8s.io/apiservices/\" range_end:\"/registry/apiregistration.k8s.io/apiservices0\" count_only:true " with result "range_response_count:0 size:9" took too long (2.397168285s) to execute 2021-03-22 20:48:16.618038 I | rafthttp: peer 4bd420be9a532df8 became active 2021-03-22 20:48:16.618059 I | rafthttp: established a TCP streaming connection with peer 4bd420be9a532df8 (stream MsgApp v2 reader) 2021-03-22 20:48:16.624716 W | rafthttp: closed an existing TCP streaming connection with peer 4bd420be9a532df8 (stream MsgApp v2 writer) 2021-03-22 20:48:16.624736 I | rafthttp: established a TCP streaming connection with peer 4bd420be9a532df8 (stream MsgApp v2 writer) 2021-03-22 20:48:16.624878 I | rafthttp: established a TCP streaming connection with peer 4bd420be9a532df8 (stream Message reader) 2021-03-22 20:48:16.627175 W | rafthttp: closed an existing TCP streaming connection with peer 4bd420be9a532df8 (stream Message writer) 2021-03-22 20:48:16.627188 I | rafthttp: established a TCP streaming connection with peer 4bd420be9a532df8 (stream Message writer) WARNING: 2021/03/22 20:48:23 grpc: Server.processUnaryRPC failed to write status: connection error: desc = "transport is closing" 2021-03-22 20:50:46.088074 I | mvcc: store.index: compact 23040043
处理方法
通过商议,决定取消 ETCD 虚拟机的定时快照。经过一天运行,ETCD 运行日志中没有再出现 lost the TCP streaming connection with peer 。
2021-03-23 19:58:04.190676 I | etcdserver: saved snapshot at index 29600607 2021-03-23 19:58:04.190832 I | etcdserver: compacted raft log at 29595607 2021-03-23 19:58:33.388856 I | pkg/fileutil: purged file /var/lib/rancher/etcd/member/snap/00000000000000a7-0000000001bfdaca.snap successfully 2021-03-23 20:00:48.065975 I | mvcc: store.index: compact 23379302 2021-03-23 20:00:48.091487 I | mvcc: finished scheduled compaction at 23379302 (took 25.110115ms) 2021-03-23 20:02:39.378543 W | etcdserver: read-only range request "key:\"/registry/configmaps/ingress-nginx/ingress-controller-leader-nginx\" " with result "range_response_count:1 size:607" took too long (117.791832ms) to execute 2021-03-23 20:05:48.073024 I | mvcc: store.index: compact 23380519 2021-03-23 20:05:48.100228 I | mvcc: finished scheduled compaction at 23380519 (took 26.693346ms) 2021-03-23 20:10:48.079058 I | mvcc: store.index: compact 23381735 2021-03-23 20:10:48.106893 I | mvcc: finished scheduled compaction at 23381735 (took 27.437585ms) 2021-03-23 20:15:30.155491 I | wal: segmented wal file /var/lib/rancher/etcd/member/wal/000000000000015a-0000000001c3c1fc.wal is created 2021-03-23 20:15:33.414322 I | pkg/fileutil: purged file /var/lib/rancher/etcd/member/wal/0000000000000155-0000000001bcc554.wal successfully 2021-03-23 20:15:48.095615 I | mvcc: store.index: compact 23382954 2021-03-23 20:15:48.122662 I | mvcc: finished scheduled compaction at 23382954 (took 26.68459ms) 2021-03-23 20:20:48.092179 I | mvcc: store.index: compact 23384166 2021-03-23 20:20:48.118647 I | mvcc: finished scheduled compaction at 23384166 (took 26.050565ms) 2021-03-23 20:25:48.097468 I | mvcc: store.index: compact 23385389 2021-03-23 20:25:48.125829 I | mvcc: finished scheduled compaction at 23385389 (took 28.006118ms) 2021-03-23 20:30:48.103474 I | mvcc: store.index: compact 23386608 2021-03-23 20:30:48.131849 I | mvcc: finished scheduled compaction at 23386608 (took 27.984451ms) 2021-03-23 20:35:48.118190 I | mvcc: store.index: compact 23387819 2021-03-23 20:35:48.144807 I | mvcc: finished scheduled compaction at 23387819 (took 26.220415ms) 2021-03-23 20:40:48.124117 I | mvcc: store.index: compact 23389036 2021-03-23 20:40:48.150120 I | mvcc: finished scheduled compaction at 23389036 (took 25.643958ms) 2021-03-23 20:45:48.130181 I | mvcc: store.index: compact 23390248 2021-03-23 20:45:48.156873 I | mvcc: finished scheduled compaction at 23390248 (took 26.303108ms) 2021-03-23 20:50:48.136542 I | mvcc: store.index: compact 23391473 2021-03-23 20:50:48.168088 I | mvcc: finished scheduled compaction at 23391473 (took 31.146475ms) 2021-03-23 20:55:48.141884 I | mvcc: store.index: compact 23392692 2021-03-23 20:55:48.168249 I | mvcc: finished scheduled compaction at 23392692 (took 25.983679ms) 2021-03-23 21:00:48.148486 I | mvcc: store.index: compact 23393921 2021-03-23 21:00:48.175167 I | mvcc: finished scheduled compaction at 23393921 (took 26.323769ms) 2021-03-23 21:05:48.153553 I | mvcc: store.index: compact 23395139 2021-03-23 21:05:48.179843 I | mvcc: finished scheduled compaction at 23395139 (took 25.890462ms) 2021-03-23 21:10:48.159706 I | mvcc: store.index: compact 23396363 2021-03-23 21:10:48.186493 I | mvcc: finished scheduled compaction at 23396363 (took 26.422307ms) 2021-03-23 21:15:48.165114 I | mvcc: store.index: compact 23397590 2021-03-23 21:15:48.191193 I | mvcc: finished scheduled compaction at 23397590 (took 25.715209ms) 2021-03-23 21:20:48.171112 I | mvcc: store.index: compact 23398807 2021-03-23 21:20:48.198116 I | mvcc: finished scheduled compaction at 23398807 (took 26.516055ms) 2021-03-23 21:25:48.176894 I | mvcc: store.index: compact 23400025 2021-03-23 21:25:48.205417 I | mvcc: finished scheduled compaction at 23400025 (took 28.06852ms) 2021-03-23 21:30:48.182368 I | mvcc: store.index: compact 23401248 2021-03-23 21:30:48.208162 I | mvcc: finished scheduled compaction at 23401248 (took 25.359328ms) 2021-03-23 21:35:48.188427 I | mvcc: store.index: compact 23402496 2021-03-23 21:35:48.214977 I | mvcc: finished scheduled compaction at 23402496 (took 26.130896ms) 2021-03-23 21:40:48.203757 I | mvcc: store.index: compact 23403714 2021-03-23 21:40:48.230346 I | mvcc: finished scheduled compaction at 23403714 (took 26.234451ms) 2021-03-23 21:45:48.203085 I | mvcc: store.index: compact 23404944 2021-03-23 21:45:48.232110 I | mvcc: finished scheduled compaction at 23404944 (took 28.66739ms) 2021-03-23 21:50:48.211722 I | mvcc: store.index: compact 23406161 2021-03-23 21:50:48.238941 I | mvcc: finished scheduled compaction at 23406161 (took 26.80059ms) 2021-03-23 21:55:48.221035 I | mvcc: store.index: compact 23407381 2021-03-23 21:55:48.248858 I | mvcc: finished scheduled compaction at 23407381 (took 27.422759ms) 2021-03-23 22:00:48.229789 I | mvcc: store.index: compact 23408597 2021-03-23 22:00:48.257402 I | mvcc: finished scheduled compaction at 23408597 (took 27.233652ms) 2021-03-23 22:05:48.243738 I | mvcc: store.index: compact 23409819 2021-03-23 22:05:48.269861 I | mvcc: finished scheduled compaction at 23409819 (took 25.728754ms) 2021-03-23 22:10:48.249241 I | mvcc: store.index: compact 23411044 2021-03-23 22:10:48.275749 I | mvcc: finished scheduled compaction at 23411044 (took 26.150918ms) 2021-03-23 22:15:48.254890 I | mvcc: store.index: compact 23412261 2021-03-23 22:15:48.281103 I | mvcc: finished scheduled compaction at 23412261 (took 25.838699ms) 2021-03-23 22:20:13.800090 W | etcdserver: read-only range request "key:\"/registry/ranges/serviceips\" " with result "range_response_count:1 size:8223" took too long (208.151939ms) to execute 2021-03-23 22:20:48.264931 I | mvcc: store.index: compact 23413481 2021-03-23 22:20:48.290955 I | mvcc: finished scheduled compaction at 23413481 (took 25.550081ms) 2021-03-23 22:25:38.543370 I | etcdserver: start to snapshot (applied: 29650608, lastsnap: 29600607) 2021-03-23 22:25:38.547640 I | etcdserver: saved snapshot at index 29650608 2021-03-23 22:25:38.547810 I | etcdserver: compacted raft log at 29645608 2021-03-23 22:25:48.270855 I | mvcc: store.index: compact 23414700 2021-03-23 22:25:48.298973 I | mvcc: finished scheduled compaction at 23414700 (took 27.738717ms) 2021-03-23 22:26:03.434776 I | pkg/fileutil: purged file /var/lib/rancher/etcd/member/snap/00000000000000a7-0000000001c09e1b.snap successfully 2021-03-23 22:30:48.275888 I | mvcc: store.index: compact 23415924 2021-03-23 22:30:48.305413 I | mvcc: finished scheduled compaction at 23415924 (took 29.128337ms) 2021-03-23 22:35:48.280997 I | mvcc: store.index: compact 23417141 2021-03-23 22:35:48.307847 I | mvcc: finished scheduled compaction at 23417141 (took 26.42452ms) 2021-03-23 22:40:48.287720 I | mvcc: store.index: compact 23418356 2021-03-23 22:40:48.317291 I | mvcc: finished scheduled compaction at 23418356 (took 29.183932ms) 2021-03-23 22:45:48.295599 I | mvcc: store.index: compact 23419585 2021-03-23 22:45:48.322346 I | mvcc: finished scheduled compaction at 23419585 (took 26.370195ms) 2021-03-23 22:50:48.301478 I | mvcc: store.index: compact 23420817 2021-03-23 22:50:48.327555 I | mvcc: finished scheduled compaction at 23420817 (took 25.656891ms) 2021-03-23 22:55:48.307901 I | mvcc: store.index: compact 23422034 2021-03-23 22:55:48.337474 I | mvcc: finished scheduled compaction at 23422034 (took 29.204046ms) 2021-03-23 23:00:48.314532 I | mvcc: store.index: compact 23423249 2021-03-23 23:00:48.340979 I | mvcc: finished scheduled compaction at 23423249 (took 26.109985ms) 2021-03-23 23:05:48.319811 I | mvcc: store.index: compact 23424469 2021-03-23 23:05:48.346462 I | mvcc: finished scheduled compaction at 23424469 (took 26.255207ms) 2021-03-23 23:10:48.326553 I | mvcc: store.index: compact 23425683 2021-03-23 23:10:48.353562 I | mvcc: finished scheduled compaction at 23425683 (took 26.632973ms) 2021-03-23 23:15:48.333217 I | mvcc: store.index: compact 23426896 2021-03-23 23:15:48.359319 I | mvcc: finished scheduled compaction at 23426896 (took 25.72745ms) 2021-03-23 23:20:48.339558 I | mvcc: store.index: compact 23428115 2021-03-23 23:20:48.366617 I | mvcc: finished scheduled compaction at 23428115 (took 26.658721ms) 2021-03-23 23:25:48.345722 I | mvcc: store.index: compact 23429337 2021-03-23 23:25:48.373084 I | mvcc: finished scheduled compaction at 23429337 (took 27.017076ms) 2021-03-23 23:30:48.351550 I | mvcc: store.index: compact 23430561 2021-03-23 23:30:48.378308 I | mvcc: finished scheduled compaction at 23430561 (took 26.3218ms) 2021-03-23 23:35:48.356635 I | mvcc: store.index: compact 23431805 2021-03-23 23:35:48.382235 I | mvcc: finished scheduled compaction at 23431805 (took 25.195416ms) 2021-03-23 23:40:48.362534 I | mvcc: store.index: compact 23433019 2021-03-23 23:40:48.388744 I | mvcc: finished scheduled compaction at 23433019 (took 25.639283ms) 2021-03-23 23:45:48.368555 I | mvcc: store.index: compact 23434235 2021-03-23 23:45:48.395256 I | mvcc: finished scheduled compaction at 23434235 (took 26.290222ms) 2021-03-23 23:50:48.373567 I | mvcc: store.index: compact 23435452 2021-03-23 23:50:48.405420 I | mvcc: finished scheduled compaction at 23435452 (took 31.411081ms) 2021-03-23 23:55:48.379145 I | mvcc: store.index: compact 23436674 2021-03-23 23:55:48.405956 I | mvcc: finished scheduled compaction at 23436674 (took 26.306194ms) 2021-03-24 00:00:48.384095 I | mvcc: store.index: compact 23437902