OCP4.x: Nodes Stuck in Reboot Loop After Upgrade.

Solution Verified - Updated -

Issue

  • After upgrading an OpenShift Container Platform (OCP) cluster, some nodes (both control plane and workers) fail to stay online, entering a crash/reboot loop. The following symptoms are observed:
    • Kernel panic with error message: kernel BUG at arch/x86/kernel/alternative.c:288!. This appears in the node’s console log or journald right before the node reboots.
kernel: zstd_compress: no symbol version for module_layout  
kernel: ------------[ cut here ]------------  
kernel: kernel BUG at arch/x86/kernel/alternative.c:288!
  • Portworx storage pods (and related containers like portworx-api and CSI driver) are in CrashLoopBackOff. The kubelet might log errors about Portworx containers failing to start or readiness probes failing. For instance: “Error syncing pod ... failed container=portworx-api ... CrashLoopBackOff” and “Failed to load PX filesystem dependencies for kernel…” messages.

Environment

  • Red Hat OpenShift Container Platform 4.16.47 to 4.16.50
  • Red Hat OpenShift Container Platform 4.17.40 to 4.17.42
  • Red Hat OpenShift Container Platform 4.18.24 to 4.18.26
  • Out-of-tree (O) kernel modules (at least one of the following):
    Oracle [oracleoks]
    IBM [mmfs26]
    IBM [tracedev]
    HPE [ice]
    HPE [numatools]
    Portworx [px]
    eTrust [SEOS]

Subscriber exclusive content

A Red Hat subscription provides unlimited access to our knowledgebase, tools, and much more.

Current Customers and Partners

Log in for full access

Log In

New to Red Hat?

Learn more about Red Hat subscriptions

Using a Red Hat product through a public cloud?

How to access this content