Pods fail with "CreateContainerError" error and "executable file not found in $PATH" is found in pod's log in OCP 4
Environment
- Red Hat OpenShift Container Platform (RHOCP)
- 4
Issue
- Pods are unable to start and stays in a "CreateContainerError" status
$ oc get pods
NAME READY STATUS RESTARTS AGE
kube-controller-manager-master1.example.com 3/4 CreateContainerError 0 18h
kube-controller-manager-master2.example.com 4/4 Running 0 12m
kube-controller-manager-master3.example.com 4/4 Running 0 18h
- oc describe shows kubelet errors about Path not found
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning Failed 59m (x8706 over 17h) kubelet (combined from similar events): Error: container create failed: time="2021-04-11T21:03:07Z" level=error msg="container_linux.go:366: starting container process caused: exec: \"cluster-kube-scheduler-operator\": executable file not found in $PATH"
Normal Pulled 4m10s (x4600 over 17h) kubelet Container image "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:6468c1dd1ca2d855e171dda54efcb56b8915ba65f9b915899d922c8720d8e7e1" already present on machine
- This issue could be affecting one or more images.
- Normally the errors are affecting a specific node.
- Deleting and redownloading the image doesn't resolve the issue.
- After trying to run the affected image using podman in the node, we get a different error:
$ podman run 2810ace6e1fe
readlink /var/lib/containers/storage/overlay: invalid argument"
Resolution
This issue is being tracked in Red Hat Bugzilla 1950536.
The workaround is to delete all the images from /var/lib/containers/storage
directories and reboot. The steps for accomplishing this are:
- Drain the node with the problematic images:
$ oc adm drain master1.example.com --ignore-daemonsets --delete-local-data --force --grace-period=1
node/master1.example.com cordoned
WARNING: ignoring DaemonSet-managed Pods: openshift-cluster-node-tuning-operator/tuned-859bg, openshift-controller-manager/controller-manager-2bvrq, openshift-dns/dns-default-d995f, openshift-image-registry/node-ca-xrw5r, openshift-machine-config-operator/machine-config-daemon-dxj98, openshift-machine-config-operator/machine-config-server-q7gpv, openshift-monitoring/node-exporter-jzxvt, openshift-multus/multus-74zhp, openshift-multus/multus-admission-controller-xqj2r, openshift-multus/network-metrics-daemon-vrst2, openshift-sdn/ovs-9vvlq, openshift-sdn/sdn-controller-m6kz9, openshift-sdn/sdn-psnlt
evicting pod openshift-image-registry/cluster-image-registry-operator-548576fb5b-frmfp
evicting pod openshift-apiserver-operator/openshift-apiserver-operator-67fd49986d-9tdmf
evicting pod openshift-apiserver/apiserver-7f54fbf8f6-psv55
evicting pod openshift-authentication-operator/authentication-operator-74c6b567fb-bx5h6
...
pod/apiserver-64f575f4f6-cr99f evicted
node/ocp46ipi-t46gj-master-0 evicted
- SSH to the node , disable
crio
andkubelet
services and reboot
$ systemctl disable crio; systemctl disable kubelet; reboot
- Once the node has restarted ssh to it again and delete storage overlay directories from the node, and after this, enable and start
crio
andkubelet
services. As root user execute:
$ rm -rf /var/lib/containers/storage/*
$ systemctl enable crio; systemctl enable kubelet
Created symlink /etc/systemd/system/multi-user.target.wants/crio.service → /usr/lib/systemd/system/crio.service.
Created symlink /etc/systemd/system/multi-user.target.wants/kubelet.service → /etc/systemd/system/kubelet.service.
$ systemctl start crio; systemctl start kubelet
- Wait some minutes and check containers are running again
$ crictl ps
CONTAINER IMAGE CREATED STATE NAME ATTEMPT POD ID
96afb20435c62 quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:5f3da24f9a2383afa1cf31d707cdcd03df0e21084523d17373f74d03349700ff 15 seconds ago Running sdn-controller 0 1862d340c8fe8
984dfa6a1f4f2 quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:5f3da24f9a2383afa1cf31d707cdcd03df0e21084523d17373f74d03349700ff 15 seconds ago Running openvswitch 0 ba8fb32dcdf30
5fb941a0a4c4d quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:5f3da24f9a2383afa1cf31d707cdcd03df0e21084523d17373f74d03349700ff 15 seconds ago Running sdn 0 2c6bc4f7d59b9
9b2b03880dd6c quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:9b58995c876bcb431e0f1d54d611a8b8e9cb7a60744a9df0a9193786d8865020 22 seconds ago Running machine-config-server 0 454b048591d4c
4cd8d971f9a6c quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:9b58995c876bcb431e0f1d54d611a8b8e9cb7a60744a9df0a9193786d8865020 22 seconds ago Running machine-config-daemon 0 490c2493f6b0e
d128e50e478be quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:c6b99fa7f1114aac1818c48eab061f13d7c0d02d70f2308c36b03a4dcda20282 27 seconds ago Running kube-rbac-proxy 0 2d95fa31b9536
- Uncordon the node.
Root Cause
This issue can be caused by an ungracefully power off of nodes while images are being pulled from a registry, leading to a one or more images getting corrupted.
Diagnostic Steps
Trying to execute podman run with the problematic image gives a different error:
$ podman run 2810ace6e1fe
readlink /var/lib/containers/storage/overlay: invalid argument"
This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.
Comments