Pods fail with "CreateContainerError" error and "executable file not found in $PATH" is found in pod's log in OCP 4

Solution Verified - Updated -

Environment

  • Red Hat OpenShift Container Platform (RHOCP)
    • 4

Issue

  • Pods are unable to start and stays in a "CreateContainerError" status
$ oc get pods 

NAME                                                      READY  STATUS     RESTARTS  AGE
kube-controller-manager-master1.example.com  3/4    CreateContainerError    0         18h
kube-controller-manager-master2.example.com  4/4    Running    0         12m
kube-controller-manager-master3.example.com  4/4    Running    0         18h
  • oc describe shows kubelet errors about Path not found
Events:
  Type     Reason  Age                     From     Message
  ----     ------  ----                    ----     -------
  Warning  Failed  59m (x8706 over 17h)    kubelet  (combined from similar events): Error: container create failed: time="2021-04-11T21:03:07Z" level=error msg="container_linux.go:366: starting container process caused: exec: \"cluster-kube-scheduler-operator\": executable file not found in $PATH"
  Normal   Pulled  4m10s (x4600 over 17h)  kubelet  Container image "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:6468c1dd1ca2d855e171dda54efcb56b8915ba65f9b915899d922c8720d8e7e1" already present on machine
  • This issue could be affecting one or more images.
  • Normally the errors are affecting a specific node.
  • Deleting and redownloading the image doesn't resolve the issue.
  • After trying to run the affected image using podman in the node, we get a different error:
$ podman run 2810ace6e1fe
readlink /var/lib/containers/storage/overlay: invalid argument"

Resolution

This issue is being tracked in Red Hat Bugzilla 1950536.

The workaround is to delete all the images from /var/lib/containers/storage directories and reboot. The steps for accomplishing this are:

  • Drain the node with the problematic images:
$ oc adm drain master1.example.com --ignore-daemonsets --delete-local-data --force --grace-period=1
node/master1.example.com cordoned
WARNING: ignoring DaemonSet-managed Pods: openshift-cluster-node-tuning-operator/tuned-859bg, openshift-controller-manager/controller-manager-2bvrq, openshift-dns/dns-default-d995f, openshift-image-registry/node-ca-xrw5r, openshift-machine-config-operator/machine-config-daemon-dxj98, openshift-machine-config-operator/machine-config-server-q7gpv, openshift-monitoring/node-exporter-jzxvt, openshift-multus/multus-74zhp, openshift-multus/multus-admission-controller-xqj2r, openshift-multus/network-metrics-daemon-vrst2, openshift-sdn/ovs-9vvlq, openshift-sdn/sdn-controller-m6kz9, openshift-sdn/sdn-psnlt
evicting pod openshift-image-registry/cluster-image-registry-operator-548576fb5b-frmfp
evicting pod openshift-apiserver-operator/openshift-apiserver-operator-67fd49986d-9tdmf
evicting pod openshift-apiserver/apiserver-7f54fbf8f6-psv55
evicting pod openshift-authentication-operator/authentication-operator-74c6b567fb-bx5h6
...
pod/apiserver-64f575f4f6-cr99f evicted
node/ocp46ipi-t46gj-master-0 evicted
  • SSH to the node , disable crio and kubelet services and reboot
 $ systemctl disable crio; systemctl disable kubelet; reboot
  • Once the node has restarted ssh to it again and delete storage overlay directories from the node, and after this, enable and start crio and kubelet services. As root user execute:
$ rm -rf /var/lib/containers/storage/*
$ systemctl enable crio; systemctl enable kubelet
Created symlink /etc/systemd/system/multi-user.target.wants/crio.service → /usr/lib/systemd/system/crio.service.
Created symlink /etc/systemd/system/multi-user.target.wants/kubelet.service → /etc/systemd/system/kubelet.service.
$ systemctl start crio; systemctl start kubelet
  • Wait some minutes and check containers are running again
$ crictl ps
CONTAINER           IMAGE                                                                                                                    CREATED             STATE               NAME                    ATTEMPT             POD ID
96afb20435c62       quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:5f3da24f9a2383afa1cf31d707cdcd03df0e21084523d17373f74d03349700ff   15 seconds ago      Running             sdn-controller          0                   1862d340c8fe8
984dfa6a1f4f2       quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:5f3da24f9a2383afa1cf31d707cdcd03df0e21084523d17373f74d03349700ff   15 seconds ago      Running             openvswitch             0                   ba8fb32dcdf30
5fb941a0a4c4d       quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:5f3da24f9a2383afa1cf31d707cdcd03df0e21084523d17373f74d03349700ff   15 seconds ago      Running             sdn                     0                   2c6bc4f7d59b9
9b2b03880dd6c       quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:9b58995c876bcb431e0f1d54d611a8b8e9cb7a60744a9df0a9193786d8865020   22 seconds ago      Running             machine-config-server   0                   454b048591d4c
4cd8d971f9a6c       quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:9b58995c876bcb431e0f1d54d611a8b8e9cb7a60744a9df0a9193786d8865020   22 seconds ago      Running             machine-config-daemon   0                   490c2493f6b0e
d128e50e478be       quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:c6b99fa7f1114aac1818c48eab061f13d7c0d02d70f2308c36b03a4dcda20282   27 seconds ago      Running             kube-rbac-proxy         0                   2d95fa31b9536
  • Uncordon the node.

Root Cause

This issue can be caused by an ungracefully power off of nodes while images are being pulled from a registry, leading to a one or more images getting corrupted.

Diagnostic Steps

Trying to execute podman run with the problematic image gives a different error:

$ podman run 2810ace6e1fe
readlink /var/lib/containers/storage/overlay: invalid argument"

This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.

Comments