Cluster upgrade fail because one of the node failed to update OS in RHOCP4

Solution Verified - Updated -

Environment

  • Red Hat OpenShift Container Platform
    • 4

Issue

  • Cluster upgrade fail because one of the node failed to update OS.

Resolution

This issue has been reported to Red Hat engineering. It is being tracked in Bug. For more information, please open a new support case with Red Hat Support.

Workaround

  • To resolve this issue, access the node ocp-lab-example-infra-node via SSH become root, then run the following command.

    podman pull --tls-verify=false --authfile /var/lib/kubelet/config.json quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:31dfb8492f2b5eefd675ee32b8e38ee4b5823a23261fdacb6ba2fd7263258b6e
    
  • Then if it succeed please restart the machine-config-daemon pods.

    oc delete po --all -n openshift-machine-config-operator -l k8s-app=machine-config-daemon
    
  • If instead it not succeed please try the following (via SSH as root from the affected node, requires 2x reboots).

    sh-5.1# rpm-ostree rebase --experimental ostree-unverified-registry:quay.io/openshift-release-dev/ocp-v4.0-art dev@sha256:31dfb8492f2b5eefd675ee32b8e38ee4b5823a23261fdacb6ba2fd7263258b6e
    sh-5.1# systemctl reboot
    sh-5.1# touch /run/machine-config-daemon-force
    

Diagnostic Steps

  • Check the machine-config cluster operator for any similar error messages.

    - lastTransitionTime: "2025-05-14T14:07:36Z"
    message: One or more machine config pools are degraded, please see `oc get mcp`
      for further details and resolve before upgrading
    reason: DegradedPool
    status: "False"
    type: Upgradeable
    
  • Review the status of the machineconfigpool (MCP) and confirm its current state.

    $ oc get mcp
    NAME     CONFIG                                             UPDATED   UPDATING   DEGRADED   MACHINECOUNT   READYMACHINECOUNT   UPDATEDMACHINECOUNT   DEGRADEDMACHINECOUNT   AGE
    infra    rendered-infra-xxx    False     True       True       6              4                   4                     1                      3y
    master   rendered-master-xxx   True      False      False      3              3                   3                     0                      4y
    worker   rendered-worker-xxx   False     False      False      2              0                   0                     0                      4y
    
  • Inspect the status of the nodes to see if any are marked as SchedulingDisabled.

    ocp-lab-example-infra-node   Ready,SchedulingDisabled   infra,worker     2y    v1.27.16+03a907c
    
  • Review the node yaml files to identify if similar error messages are present.

    $ oc get node ocp-lab-example-infra-node  -oyaml
    
    machineconfiguration.openshift.io/reason: |-
      failed to update OS to quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:31dfb8492f2b5eefd675ee32b8e38ee4b5823a23261fdacb6ba2fd7263258b6e : error running rpm-ostree rebase --experimental ostree-unverified-registry:quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:31dfb8492f2b5eefd675ee32b8e38ee4b5823a23261fdacb6ba2fd7263258b6e: error: Creating importer: Failed to invoke skopeo proxy method OpenImage: remote error: pinging container registry quay.io: Get "https://quay.io/v2/": dial tcp: lookup quay.io on [::1]:53: read udp [::1]:43165->[::1]:53: read: connection refused
      : exit status 1
    machineconfiguration.openshift.io/state: Degraded
    
  • Verify if the machine-config-operator is also reporting a similar error.

    $ oc logs machine-config-daemon-xxx -c machine-config-daemon
    2025-05-15T19:49:07.812893595Z E0515 19:49:07.812878 3429607 writer.go:226] Marking Degraded due to: failed to update OS to quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:31dfb8492f2b5eefd675ee32b8e38ee4b5823a23261fdacb6ba2fd7263258b6e : error running rpm-ostree rebase --experimental ostree-unverified-registry:quay.io/openshift-release-dev/ocp-v4.0-art dev@sha256:31dfb8492f2b5eefd675ee32b8e38ee4b5823a23261fdacb6ba2fd7263258b6e: error: Creating importer: Failed to invoke skopeo proxy method OpenImage: remote error: pinging container registry quay.io: Get "https://quay.io/v2/": dial tcp: lookup quay.io on [::1]:53: read udp [::1]:54775->[::1]:53: read: connection refused
    

This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.

Comments