Troubleshooting machine provisioning failure in OpenShift 4.x

Solution Verified - Updated -

Environment

  • Red Hat Openshift Container Platform (OCP 4)
  • Red Hat OpenShift Service on AWS (ROSA 4)
  • Red Hat Openshift Dedicated 4 (OSD 4)

Issue

  • Worker nodes not scaling up after increasing the nodes count.
  • Machines are not provisioning due to authorization failure.
  • Machines getting stuck in a loop where it comes in the provisioning state, gets into the failed state and then gets deleted.
  • Following log is observed from the events in openshift-machine-api namespace:
reconciler failed to Create machine: failed to launch instance: error launching instance: You are not authorized to perform this operation.

Resolution

As per the event logs, the reconciler failed to create machine (failed to launch instance) due to authorization issue. This error indicates that permissions attached to the AWS Identity and Access Management (IAM) role or user trying to perform the operation doesn't have the required permissions to launch EC2 instances. Because the error involves an encoded message, you can use the AWS Command Line Interface (AWS CLI) to decode the message. This decoding provides more details regarding the authorization failure.

Run the following decode-authorization-message command:

$ aws sts decode-authorization-message --encoded-message encoded-message

Replace encoded-message with the exact encoded message contained in the error message.

Root Cause

The reconciler failed to create machine (failed to launch instance) due to authorization issue. This error indicates that permissions attached to the AWS Identity and Access Management (IAM) role or user trying to perform the operation doesn't have the required permissions to launch EC2 instances.

Diagnostic Steps

(1) Check the machines in openshift-machine-api namespace and check if the machines are getting stuck in a loop where it comes in the provisioning state, gets into the failed state and then gets deleted.

oc get machines -n openshift-machine-api
NAME                                      PHASE          TYPE         REGION      ZONE         AGE
rosa-demo-xxxxx-infra-eu-west-3a-xxxxx    Running        r5.xlarge    eu-west-3   eu-west-3a   334d
rosa-demo-xxxxx-infra-eu-west-3a-xxxxx    Running        r5.xlarge    eu-west-3   eu-west-3a   334d
rosa-demo-xxxxx-master-0                  Running        m5.2xlarge   eu-west-3   eu-west-3a   334d
rosa-demo-xxxxx-master-1                  Running        m5.2xlarge   eu-west-3   eu-west-3a   334d
rosa-demo-xxxxx-master-2                  Running        m5.2xlarge   eu-west-3   eu-west-3a   334d
rosa-demo-xxxxx-worker-eu-west-3a-xxxxx   Failed                                               3s       <-------
rosa-demo-xxxxx-worker-eu-west-3a-xxxxx   Running        m5.xlarge    eu-west-3   eu-west-3a   199d
rosa-demo-xxxxx-worker-eu-west-3a-xxxxx   Deleting                                             7s       <-------
rosa-demo-xxxxx-worker-eu-west-3a-xxxxx   Provisioning                                         2s       <-------

(2) Check the events from openshift-machine-api namespace:

6m          Warning   FailedCreate            machine/rosa-demo-xxxxx-worker-eu-west-3a-xxxxx       rosa-demo-xxxxx-worker-eu-west-3a-xxxxx: reconciler failed to Create machine: failed to launch instance: error launching instance: You are not authorized to perform this operation. Encoded authorization failure message: k0KAbS-xbzFT8-_k...{ENCODED FAILURE MESSAGE }...DkE3akeyN3Sh8YQ0ghgE
6m          Normal    MachineDeleted          machine/rosa-demo-xxxxx-worker-eu-west-3a-xxxxx       Machine openshift-machine-api/srep-worker-healthcheck/rosa-demo-xxxxx-worker-eu-west-3a-xxxxx/ has been remediated by requesting to delete Machine object
5m59s       Normal    Delete                  machine/rosa-demo-xxxxx-worker-eu-west-3a-xxxxx       Deleted machine rosa-demo-xxxxx-worker-eu-west-3a-xxxxx
77s         Normal    SuccessfulUpdate        machineautoscaler/rosa-demo-xxxx-worker-eu-west-3a   Updated MachineAutoscaler target: openshift-machine-api/rosa-demo-xxxxx-worker-eu-west-3a
3m26s       Warning   RemediationRestricted   machinehealthcheck/srep-worker-healthcheck                  Remediation restricted due to exceeded number of unhealthy machines (total: 5, unhealthy: 4, maxUnhealthy: 3)
42m         Warning   RemediationRestricted   machinehealthcheck/srep-worker-healthcheck                  Remediation restricted due to exceeded number of unhealthy machines (total: 6, unhealthy: 5, maxUnhealthy: 3)

(3) Safely capture the Encoded authorization failure message from the event log to further use for decoding:
For example, below snippet from the above event log:

k0KAbS-xbzFT8-_k...{ENCODED FAILURE MESSAGE }...DkE3akeyN3Sh8YQ0ghgE

This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.

Comments