OCP4.x: Nodes Stuck in Reboot Loop After Upgrade.
Issue
- After upgrading an OpenShift Container Platform (OCP) cluster, some nodes (both control plane and workers) fail to stay online, entering a crash/reboot loop. The following symptoms are observed:
- Kernel panic with error message:
kernel BUG at arch/x86/kernel/alternative.c:288!. This appears in the node’s console log or journald right before the node reboots.
- Kernel panic with error message:
kernel: zstd_compress: no symbol version for module_layout
kernel: ------------[ cut here ]------------
kernel: kernel BUG at arch/x86/kernel/alternative.c:288!
- Portworx storage pods (and related containers like portworx-api and CSI driver) are in CrashLoopBackOff. The kubelet might log errors about Portworx containers failing to start or readiness probes failing. For instance: “Error syncing pod ... failed container=portworx-api ... CrashLoopBackOff” and “Failed to load PX filesystem dependencies for kernel…” messages.
Environment
- Red Hat OpenShift Container Platform 4.16.47 to 4.16.50
- Red Hat OpenShift Container Platform 4.17.40 to 4.17.42
- Red Hat OpenShift Container Platform 4.18.24 to 4.18.26
- Out-of-tree (O) kernel modules (at least one of the following):
Oracle [oracleoks]
IBM [mmfs26]
IBM [tracedev]
HPE [ice]
HPE [numatools]
Portworx [px]
eTrust [SEOS]
Subscriber exclusive content
A Red Hat subscription provides unlimited access to our knowledgebase, tools, and much more.