Troubleshooting machine provisioning failure in OpenShift 4.x
Environment
- Red Hat Openshift Container Platform (OCP 4)
- Red Hat OpenShift Service on AWS (ROSA 4)
- Red Hat Openshift Dedicated 4 (OSD 4)
Issue
- Worker nodes not scaling up after increasing the nodes count.
- Machines are not provisioning due to authorization failure.
- Machines getting stuck in a loop where it comes in the provisioning state, gets into the failed state and then gets deleted.
- Following log is observed from the events in openshift-machine-api namespace:
reconciler failed to Create machine: failed to launch instance: error launching instance: You are not authorized to perform this operation.
Resolution
As per the event logs, the reconciler failed to create machine (failed to launch instance) due to authorization issue. This error indicates that permissions attached to the AWS Identity and Access Management (IAM) role or user trying to perform the operation doesn't have the required permissions to launch EC2 instances. Because the error involves an encoded message, you can use the AWS Command Line Interface (AWS CLI) to decode the message. This decoding provides more details regarding the authorization failure.
Run the following decode-authorization-message command:
$ aws sts decode-authorization-message --encoded-message encoded-message
Replace encoded-message with the exact encoded message contained in the error message.
Root Cause
The reconciler failed to create machine (failed to launch instance) due to authorization issue. This error indicates that permissions attached to the AWS Identity and Access Management (IAM) role or user trying to perform the operation doesn't have the required permissions to launch EC2 instances.
Diagnostic Steps
(1) Check the machines in openshift-machine-api namespace and check if the machines are getting stuck in a loop where it comes in the provisioning state, gets into the failed state and then gets deleted.
oc get machines -n openshift-machine-api
NAME PHASE TYPE REGION ZONE AGE
rosa-demo-xxxxx-infra-eu-west-3a-xxxxx Running r5.xlarge eu-west-3 eu-west-3a 334d
rosa-demo-xxxxx-infra-eu-west-3a-xxxxx Running r5.xlarge eu-west-3 eu-west-3a 334d
rosa-demo-xxxxx-master-0 Running m5.2xlarge eu-west-3 eu-west-3a 334d
rosa-demo-xxxxx-master-1 Running m5.2xlarge eu-west-3 eu-west-3a 334d
rosa-demo-xxxxx-master-2 Running m5.2xlarge eu-west-3 eu-west-3a 334d
rosa-demo-xxxxx-worker-eu-west-3a-xxxxx Failed 3s <-------
rosa-demo-xxxxx-worker-eu-west-3a-xxxxx Running m5.xlarge eu-west-3 eu-west-3a 199d
rosa-demo-xxxxx-worker-eu-west-3a-xxxxx Deleting 7s <-------
rosa-demo-xxxxx-worker-eu-west-3a-xxxxx Provisioning 2s <-------
(2) Check the events from openshift-machine-api namespace:
6m Warning FailedCreate machine/rosa-demo-xxxxx-worker-eu-west-3a-xxxxx rosa-demo-xxxxx-worker-eu-west-3a-xxxxx: reconciler failed to Create machine: failed to launch instance: error launching instance: You are not authorized to perform this operation. Encoded authorization failure message: k0KAbS-xbzFT8-_k...{ENCODED FAILURE MESSAGE }...DkE3akeyN3Sh8YQ0ghgE
6m Normal MachineDeleted machine/rosa-demo-xxxxx-worker-eu-west-3a-xxxxx Machine openshift-machine-api/srep-worker-healthcheck/rosa-demo-xxxxx-worker-eu-west-3a-xxxxx/ has been remediated by requesting to delete Machine object
5m59s Normal Delete machine/rosa-demo-xxxxx-worker-eu-west-3a-xxxxx Deleted machine rosa-demo-xxxxx-worker-eu-west-3a-xxxxx
77s Normal SuccessfulUpdate machineautoscaler/rosa-demo-xxxx-worker-eu-west-3a Updated MachineAutoscaler target: openshift-machine-api/rosa-demo-xxxxx-worker-eu-west-3a
3m26s Warning RemediationRestricted machinehealthcheck/srep-worker-healthcheck Remediation restricted due to exceeded number of unhealthy machines (total: 5, unhealthy: 4, maxUnhealthy: 3)
42m Warning RemediationRestricted machinehealthcheck/srep-worker-healthcheck Remediation restricted due to exceeded number of unhealthy machines (total: 6, unhealthy: 5, maxUnhealthy: 3)
(3) Safely capture the Encoded authorization failure message from the event log to further use for decoding:
For example, below snippet from the above event log:
k0KAbS-xbzFT8-_k...{ENCODED FAILURE MESSAGE }...DkE3akeyN3Sh8YQ0ghgE
This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.
Comments