Error when upgrading the cluster: "ImagePullBackOff: error creating read-write layer with ID * no such file or directory"
Environment
- Red Hat OpenShift Container Platform 4.12 and earlier versions.
Issue
While upgrading a cluster, a node fails to pull images. Its Machine Config Daemon pod shows an error like the following:
error creating read-write layer with ID "4372a0c382584d7752da058c5267d1d652d727585457a71c5ef3d4d17a951719": Stat /var/lib/containers/storage/overlay/27abd31f77c1e21b8897140edb61d1e52d48a3cd287c03796dbecb54684871d3: no such file or directory
Due to the latter, the upgrade job becomes blocked.
The problem can happen on several nodes in a row.
Resolution
-
The issue happens due to a problem in CRI-O image layer handling, which is documented in bug OCPBUGS-16874.
-
Upgrade to OCP 4.13, or OCP 4.12.45 or later to avoid this issue. Note that if you have experienced this issue, you must perform the below steps to wipe all CRI-O storage to clear the condition regardless of whether the node was successfully upgraded; it must be wiped at least once to clear the condition.
-
If a cluster is affected, the following workaround can be applied to solve the problem:
-
Drain the node
$ oc adm drain --ignore-daemonsets --delete-emptydir-data ${NODE} -
In the node affected, run the following commands as root:
$ systemctl disable kubelet $ systemctl disable crio $ reboot -
After the reboot, execute the following commands also as root user:
$ rm -rf /var/lib/containers/* $ crio wipe -f $ systemctl enable --now crio $ systemctl enable --now kubelet -
Uncordon the node
$ oc adm uncordon $NODE
Root Cause
- An issue regarding how container storage handles layers was fixed and merged into CRI-O as of OCP 4.12.45.
- As CRI-O utilizes a shared container storage package, the fixes can be seen in the
containers/storagepackage which were then imported into CRI-O as of OCP 4.12.45 or later.- https://github.com/containers/storage/pull/1138
- https://github.com/containers/storage/pull/1407
- Prior to these versions, layer handling could potentially be mishandled, causing errors when reading or accessing container image layer directories.
Diagnostic Steps
If a cluster upgrade becomes blocked, check whether there is any pod which name starts with machine-config-daemon is in an unhealthy status like ContainerCreating around 5 minutes after it was created. For that, the following command can be executed:
$ oc get pod -n openshift-machine-config-operator
In case there is a pod in an unhealthy status, execute the following command:
$ oc logs <pod_name> -n openshift-machine-config-operator
The cluster should be affected by this bug if the error shared in the "Issue" section and pasted again below shows up:
error creating read-write layer with ID "4372a0c382584d7752da058c5267d1d652d727585457a71c5ef3d4d17a951719": Stat /var/lib/containers/storage/overlay/27abd31f77c1e21b8897140edb61d1e52d48a3cd287c03796dbecb54684871d3: no such file or directory
This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.
Comments