Scaling up a new machine is stuck in a Provisioned state in ARO/Azure
Environment
- Red Hat OpenShift Container Platform (RHOCP)
- 4
- Azure Red Hat OpenShift (ARO)
- 4
- Azure
Issue
Disclaimer: Links contained herein to external website(s) are provided for convenience only. Red Hat has not reviewed the links and is not responsible for the content or its availability. The inclusion of any link to an external website does not imply endorsement by Red Hat of the website or their entities, products or services. You agree that Red Hat is not responsible or liable for any loss or expenses that may result due to your use of (or reliance on) the external site or content.
- When trying to scale up a new machine in ARO or in an OpenShift cluster installed on Azure, the new machine is stuck in
Provisioned
state but no new node is added to the cluster. -
Errors like the following ones are shown in the
machine-api-controllers
:controller/machine_controller "msg"="Reconciler error" "error"="vm for machine machine_name-compute-region-xxxx exists, but has unexpected 'Failed' provisioning state" "name"="machine_name-compute-region-xxxx" "namespace"="openshift-machine-api
machine_name-compute-region-xxxx not found: %!w(string=compute.VirtualMachinesClient#Get: Failure responding to request: StatusCode=404 -- Original Error: autorest/azure: Service returned an error. Status=404 Code="ResourceNotFound" Message="The Resource 'Microsoft.Compute/virtualMachines/machine_name-compute-region-xxxx' under resource group 'rg-name' was not found. For more details please go to https://aka.ms/ARMResourceNotFoundFix")
-
The following message is shown in the
machine
:Failed to check if machine exists: vm for machine machine_name-compute-region-xxxx exists, but has unexpected 'Failed' provisioning state
-
The following
FailedCreate
event is shown in the machine or in theopenshift-machine-api
namespace:CreateError: failed to reconcile machine "machine_name-compute-region-xxxx"s: failed to create vm machine_name-compute-region-xxxx: failed to create or get machine: compute.VirtualMachinesCreateOrUpdateFuture: asynchronous operation has not completed
Resolution
The "failed to create or get machine: compute.VirtualMachinesCreateOrUpdateFuture: asynchronous operation has not completed
" error message, as shown in the Diagnostics Steps section, means that the provisioning on the instance on Azure is not yet finished and the machine-api
is checking for the completion.
The machine-controller
will regularly check the state of the Virtual Machine until Azure shows it as provisioned.
Machine stuck in Provisioned
or Provisioning
state
If the issue persists and the machine continues as Provisioned
or Provisioning
state with the Status=404 Code="ResourceNotFound"
messages, it will be needed to check in the Azure web console for details on why the provisioning of the Virtual Machine is not succeeding.
An example of an Azure error for this issue could be:
Allocation failed. We do not have sufficient capacity for the requested VM size in this zone. Read more about improving likelihood of allocation success at http://aka.ms/allocation-guidance.
ZonalAllocationFailed
Provisioning state error code ProvisioningState/failed/ZonalAllocationFailed
Note: The above is an example, and the real issue could be a different one. It's needed to check the real issue in the Azure web console.
Machine provisioned in Azure but OCP node not added to the cluster
If the machine is properly provisioned in Azure, but the node is not added to the cluster, check for a message with "Provisioning state is 'Succeeded' for machine" with the name of the machine in Provisioned
status (refer to the Diagnostic Steps section for additional information).
If the machine finished the provisioning but the OCP node is not added to the cluster, check for a misconfigured ImageContentSourcePolicy
, ContainerRuntimeConfig
, KubeletConfig
or MachineConfig
(in ARO, configuring any of those resources is not allowed). Access with ssh
to the Machine and check errors in the journalctl
, kubelet
logs, ...
Root Cause
The Virtual Machine that was looked up by the machine-controller
was not found on the Azure API.
Diagnostic Steps
Check for the error messages has unexpected 'Failed' provisioning state
and the ResourceNotFound
related to the failing Virtual Machine:
$ oc get pods -n openshift-machine-api | grep machine-api-controllers
[machine-api-controllers-pod_name] 7/7 Running 0 30d
$ oc logs -n openshift-machine-api -c machine-controller [machine-api-controllers-pod_name] | grep "has unexpected 'Failed' provisioning state\|ResourceNotFound.*Microsoft.Compute/virtualMachines"
2022-01-01T00:00:00.481590671Z W0101 00:00:00.481539 1 virtualmachines.go:93] vm machine_name-compute-region-xxxx not found: %!w(string=compute.VirtualMachinesClient#Get: Failure responding to request: StatusCode=404 -- Original Error: autorest/azure: Service returned an error. Status=404 Code="ResourceNotFound" Message="The Resource 'Microsoft.Compute/virtualMachines/machine_name-compute-region-xxxx' under resource group 'rg-name' was not found. For more details please go to https://aka.ms/ARMResourceNotFoundFix")
[...]
2022-01-01T00:00:00.259872468Z E0101 00:00:45.259825 1 actuator.go:225] failed to check machine machine_name-compute-region-xxxx exists: vm for machine machine_name-compute-region-xxxx exists, but has unexpected 'Failed' provisioning state
Check also for the messages of the Virtual Machine creation:
$ oc logs -n openshift-machine-api -c machine-controller [machine-api-controllers-pod_name]
[...]
2022-01-01T00:00:00.695862183Z I0101 00:00:23.695838 1 virtualmachines.go:120] creating vm machine_name-compute-region-xxxx
2022-01-01T00:00:00.036280785Z I0101 00:00:25.036243 1 machine_scope.go:192] machine_name-compute-region-xxxx: patching machine
2022-01-01T00:00:00.064348543Z E0101 00:00:25.064296 1 actuator.go:79] Machine error: failed to reconcile machine "machine_name-compute-region-xxxx"s: failed to create vm machine_name-compute-region-xxxx: failed to create or get machine: compute.VirtualMachinesCreateOrUpdateFuture: asynchronous operation has not completed
2022-01-01T00:00:00.064421342Z I0101 00:00:25.064397 1 logr.go:252] events "msg"="Warning" "message"="CreateError: failed to reconcile machine \"machine_name-compute-region-xxxx\"s: failed to create vm machine_name-compute-region-xxxx: failed to create or get machine: compute.VirtualMachinesCreateOrUpdateFuture: asynchronous operation has not completed" "object"={"kind":"Machine","namespace":"openshift-machine-api","name":"machine_name-compute-region-xxxx","uid":"xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx","apiVersion":"machine.openshift.io/v1beta1","resourceVersion":"660674"} "reason"="FailedCreate"
And that there are not Provisioning state is 'Succeeded' for machine
messages for that Virtual Machine (change the name of the [machine_name-compute-region-xxxx]
in the command):
$ oc logs -n openshift-machine-api -c machine-controller [machine-api-controllers-pod_name] | grep "Provisioning state is 'Succeeded' for machine" | grep [machine_name-compute-region-xxxx]
Check the events for the machine stuck in Provisioned
:
$ oc get machines -n openshift-machine-api
NAME PHASE TYPE REGION ZONE AGE
[...]
machine_name-compute-region-xxxx Provisioned Standard_E32s_v5 region 2 13m
[...]
$ oc describe machine machine_name-compute-region-xxxx -n openshift-machine-api
[...]
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning FailedCreate 95d azure-controller CreateError: failed to reconcile machine "machine_name-compute-region-xxxx"s: failed to create vm machine_name-compute-region-xxxx: failed to create or get machine: compute.VirtualMachinesCreateOrUpdateFuture: asynchronous operation has not completed
[...]
This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.
Comments