PVC creation on OpenShift on Azure fails with HTTPStatusCode: 403 does not have authorization to perform action

Solution Verified - Updated -

Environment

  • Red Hat OpenShift Container Platform 4.x
  • Microsoft Azure

Issue

  • PVCs are stuck in Pending status with errors in the Events log such as the following:
"GRPC error: Retriable: false, RetryAfter: 0s, HTTPStatusCode: 403, RawError: {"error":{"code":"AuthorizationFailed","message":"The client '680d27f3-7cc7-4ffb-b9a0-895212345678' with object id '680d27f3-7cc7-4ffb-b9a0-895212345678' does not have authorization to perform action 'Microsoft.Compute/disks/write' over scope '/subscriptions/d5e93d54-574c-47a3-a9cf-9e6e87654321/resourceGroups/rg-my-ocp-cluster/providers/Microsoft.Compute/disks/pvc-6cbb4e2a-5411-4740-a346-456e6101928374' or the scope is invalid. If access was recently granted, please refresh your credentials."}}"

Resolution

  • Remove any Azure Security Policies that have been applied to the OCP nodes in the Azure Resource Group
  • For each OpenShift node, disable the System Assigned Managed Identity, ensuring that only the User Assigned Managed Identity remains
  • Replace all worker nodes by scaling up new worker nodes and scaling down old worker nodes
  • Drain, shutdown, and restart each master node one-by-one

Root Cause

  • Azure Security Policies have been applied to objects in the Resource Group that contains the OpenShift cluster
  • Applying the Security Policy creates a System Assigned Managed Identity (SAMI) for each OpenShift node
  • OpenShift will start using the SAMI instead of the User Assigned Managed Identity (UAMI) that was used to create the cluster
  • The SAMI does not have the proper Contributor role required to perform actions, such as creating PVs

Diagnostic Steps

  • Look for events in the Pending PVC or in the cluster logs that indicate an authorization error
  • In the error, look for the UUID of the Client or Object ID, which should be the same. Search for that UUID in the Azure console and validate that it is assigned to a specific cluster node and is of the type "Enterprise Application"
  • Notably, the UUID in the error is not the Service Principle that was used during the OpenShift installation
  • In the Azure console, determine if any Security Profiles, such as ASC provisioning machines with user assigned MI for GC agent, ASC provisioning machines with no MI for GC agent, ASC provisioning Guest Configuration agent for Linux, or AzurePolicyForLinux have been applied to the Resource Group

This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.

Comments