Scaling up a new machine is stuck in a Provisioned state in ARO/Azure

Solution Verified - Updated -

Environment

  • Red Hat OpenShift Container Platform (RHOCP)
    • 4
  • Azure Red Hat OpenShift (ARO)
    • 4
  • Azure

Issue

Disclaimer: Links contained herein to external website(s) are provided for convenience only. Red Hat has not reviewed the links and is not responsible for the content or its availability. The inclusion of any link to an external website does not imply endorsement by Red Hat of the website or their entities, products or services. You agree that Red Hat is not responsible or liable for any loss or expenses that may result due to your use of (or reliance on) the external site or content.

  • When trying to scale up a new machine in ARO or in an OpenShift cluster installed on Azure, the new machine is stuck in Provisioned state but no new node is added to the cluster.
  • Errors like the following ones are shown in the machine-api-controllers:

    controller/machine_controller "msg"="Reconciler error" "error"="vm for machine machine_name-compute-region-xxxx exists, but has unexpected 'Failed' provisioning state" "name"="machine_name-compute-region-xxxx" "namespace"="openshift-machine-api
    
    machine_name-compute-region-xxxx not found: %!w(string=compute.VirtualMachinesClient#Get: Failure responding to request: StatusCode=404 -- Original Error: autorest/azure: Service returned an error. Status=404 Code="ResourceNotFound" Message="The Resource 'Microsoft.Compute/virtualMachines/machine_name-compute-region-xxxx' under resource group 'rg-name' was not found. For more details please go to https://aka.ms/ARMResourceNotFoundFix")
    
  • The following message is shown in the machine:

    Failed to check if machine exists: vm for machine machine_name-compute-region-xxxx exists, but has unexpected 'Failed' provisioning state
    
  • The following FailedCreate event is shown in the machine or in the openshift-machine-api namespace:

    CreateError: failed to reconcile machine "machine_name-compute-region-xxxx"s: failed to create vm machine_name-compute-region-xxxx: failed to create or get machine: compute.VirtualMachinesCreateOrUpdateFuture: asynchronous operation has not completed
    

Resolution

The "failed to create or get machine: compute.VirtualMachinesCreateOrUpdateFuture: asynchronous operation has not completed" error message, as shown in the Diagnostics Steps section, means that the provisioning on the instance on Azure is not yet finished and the machine-api is checking for the completion.

The machine-controller will regularly check the state of the Virtual Machine until Azure shows it as provisioned.

Machine stuck in Provisioned or Provisioning state

If the issue persists and the machine continues as Provisioned or Provisioning state with the Status=404 Code="ResourceNotFound" messages, it will be needed to check in the Azure web console for details on why the provisioning of the Virtual Machine is not succeeding.

An example of an Azure error for this issue could be:

Allocation failed. We do not have sufficient capacity for the requested VM size in this zone. Read more about improving likelihood of allocation success at http://aka.ms/allocation-guidance.

ZonalAllocationFailed
Provisioning state error code ProvisioningState/failed/ZonalAllocationFailed

Note: The above is an example, and the real issue could be a different one. It's needed to check the real issue in the Azure web console.

Machine provisioned in Azure but OCP node not added to the cluster

If the machine is properly provisioned in Azure, but the node is not added to the cluster, check for a message with "Provisioning state is 'Succeeded' for machine" with the name of the machine in Provisioned status (refer to the Diagnostic Steps section for additional information).

If the machine finished the provisioning but the OCP node is not added to the cluster, check for a misconfigured ImageContentSourcePolicy, ContainerRuntimeConfig, KubeletConfig or MachineConfig (in ARO, configuring any of those resources is not allowed). Access with ssh to the Machine and check errors in the journalctl, kubelet logs, ...

Root Cause

The Virtual Machine that was looked up by the machine-controller was not found on the Azure API.

Diagnostic Steps

Check for the error messages has unexpected 'Failed' provisioning state and the ResourceNotFound related to the failing Virtual Machine:

$ oc get pods -n openshift-machine-api | grep machine-api-controllers
[machine-api-controllers-pod_name]      7/7     Running   0          30d

$ oc logs -n openshift-machine-api -c machine-controller [machine-api-controllers-pod_name] | grep "has unexpected 'Failed' provisioning state\|ResourceNotFound.*Microsoft.Compute/virtualMachines"

2022-01-01T00:00:00.481590671Z W0101 00:00:00.481539       1 virtualmachines.go:93] vm machine_name-compute-region-xxxx not found: %!w(string=compute.VirtualMachinesClient#Get: Failure responding to request: StatusCode=404 -- Original Error: autorest/azure: Service returned an error. Status=404 Code="ResourceNotFound" Message="The Resource 'Microsoft.Compute/virtualMachines/machine_name-compute-region-xxxx' under resource group 'rg-name' was not found. For more details please go to https://aka.ms/ARMResourceNotFoundFix")
[...]
2022-01-01T00:00:00.259872468Z E0101 00:00:45.259825       1 actuator.go:225] failed to check machine machine_name-compute-region-xxxx exists: vm for machine machine_name-compute-region-xxxx exists, but has unexpected 'Failed' provisioning state

Check also for the messages of the Virtual Machine creation:

$ oc logs -n openshift-machine-api -c machine-controller [machine-api-controllers-pod_name] 
[...]
2022-01-01T00:00:00.695862183Z I0101 00:00:23.695838       1 virtualmachines.go:120] creating vm machine_name-compute-region-xxxx
2022-01-01T00:00:00.036280785Z I0101 00:00:25.036243       1 machine_scope.go:192] machine_name-compute-region-xxxx: patching machine
2022-01-01T00:00:00.064348543Z E0101 00:00:25.064296       1 actuator.go:79] Machine error: failed to reconcile machine "machine_name-compute-region-xxxx"s: failed to create vm machine_name-compute-region-xxxx: failed to create or get machine: compute.VirtualMachinesCreateOrUpdateFuture: asynchronous operation has not completed

2022-01-01T00:00:00.064421342Z I0101 00:00:25.064397       1 logr.go:252] events "msg"="Warning"  "message"="CreateError: failed to reconcile machine \"machine_name-compute-region-xxxx\"s: failed to create vm machine_name-compute-region-xxxx: failed to create or get machine: compute.VirtualMachinesCreateOrUpdateFuture: asynchronous operation has not completed" "object"={"kind":"Machine","namespace":"openshift-machine-api","name":"machine_name-compute-region-xxxx","uid":"xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx","apiVersion":"machine.openshift.io/v1beta1","resourceVersion":"660674"} "reason"="FailedCreate"

And that there are not Provisioning state is 'Succeeded' for machine messages for that Virtual Machine (change the name of the [machine_name-compute-region-xxxx] in the command):

$ oc logs -n openshift-machine-api -c machine-controller [machine-api-controllers-pod_name] | grep "Provisioning state is 'Succeeded' for machine" | grep [machine_name-compute-region-xxxx]

Check the events for the machine stuck in Provisioned:

$ oc get machines -n openshift-machine-api
NAME                                                   PHASE         TYPE               REGION               ZONE   AGE
[...]
machine_name-compute-region-xxxx   Provisioned   Standard_E32s_v5   region   2      13m
[...]

$ oc describe machine machine_name-compute-region-xxxx -n openshift-machine-api
[...]
Events:
  Type     Reason        Age                From              Message
  ----     ------        ----               ----              -------
  Warning  FailedCreate  95d                azure-controller  CreateError: failed to reconcile machine "machine_name-compute-region-xxxx"s: failed to create vm machine_name-compute-region-xxxx: failed to create or get machine: compute.VirtualMachinesCreateOrUpdateFuture: asynchronous operation has not completed
[...]

This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.

Comments