本文永久链接: https://www.xtplayer.cn/rke2/kubelet-error-dial-unix-var-run-eni-eni-socket-connect-no-such-file-or-directory/

根据 terway 文档进行部署 https://github.com/AliyunContainerService/terway/blob/main/docs/terway-with-cilium.md ,部署完成后在执行 tail -f /var/lib/rancher/rke2/agent/logs/kubelet.log 可以看到如下的错误:

E0514 10:59:01.271853    1314 kuberuntime_manager.go:705] "killPodWithSyncResult failed" err="failed to \"KillPodSandbox\" for \"ebba0c90-bf0d-400e-b90d-f46e0f86dad3\" with KillPodSandboxError: \"rpc error: code = Unknown desc = failed to destroy network for sandbox \\\"648f025e0f6f670c97fbb66ad7917f171b048d842dfadf3feb6af5ac2ad11a5e\\\": plugin type=\\\"terway\\\" failed (delete): failed to do del; error get ip from terway, pod kube-system/rke2-coredns-rke2-coredns-775b8cc74f-4phm8, rpc error: code = Unavailable desc = connection error: desc = \\\"transport: Error while dialing: dial unix var/run/eni/eni.socket: connect: no such file or directory\\\"\""

可以看到提示找不到 terway 驱动的 socket 文件,但是在主机的 /var/run/eni/eni.socket 目录是存在相应文件。

问题分析

containerd 执行 CNI 插件时,是以 sandboxed 子进程(fork-exec)运行,从报错来看,terway 使用 socket connection 时提示的错误为相对路径: var/run/eni/eni.socket。通过执行以下命令,查看 rke2 下 kubelet 的 cwd 目录输出

ls -l /proc/$(pidof kubelet)/cwd
lrwxrwxrwx 1 root root 0 May 19 18:25 /proc/2765834/cwd -> /var/lib/rancher/rke2/server

解决方法

将以下内容保存为 yaml 文件,比如 terway-eniip-patch.yaml,其中的 image 根据实际版本进行修改。

spec:
template:
spec:
initContainers:
- command:
- ln
- '-sf'
- /run/eni
- /var/lib/rancher/rke2/server/var/run/eni
image: registry-cn-hangzhou.ack.aliyuncs.com/acs/terway:v1.13.5
imagePullPolicy: IfNotPresent
name: init-eni-socket-dir
volumeMounts:
- mountPath: /var/lib/rancher/rke2/server/var/run
name: rke2-dir
volumes:
- hostPath:
path: /var/lib/rancher/rke2/server/var/run
type: ''
name: rke2-dir

然后执行 patch 命令去添加 init 容器,这个容器会自动将主机的 /run/eni link 到 /var/lib/rancher/rke2/server/var/run/eni 目录。

kubectl -n kube-system patch daemonsets.apps terway-eniip --patch-file  terway-eniip-patch.yaml