Troubleshooting Azure Red Hat OpenShift 4.14.z / 4.13.40: Cluster Installs and provisioning new machines fail

Solution Verified - Updated -

Environment

  • Azure Red Hat OpenShift (ARO)
    • 4.14.z
    • 4.13.40

Issue

Update as of October 25, 2024

This issue is now fixed for installations and upgrades. Upgrades along Z-streams without admin acks are considered safe for clusters with special network routing.

  • Clusters installations at 4.14.z or 4.13.40 fails with InternalServerError:

    • Private cluster with User Defined Routing (UDR) specified and the subnet route tables lacks an Internet route for arosvc.azurecr.io
    • When a Virtual Appliance / Firewall is used that blocks outbound Internet traffic for arosvc.azurecr.io
  • Clusters upgraded to 4.14.z / 4.13.40 fail to provision new machines:

    • When UDR is specified and the subnet route tables lack an Internet route for arosvc.azurecr.io
    • When a Virtual Appliance / Firewall is used that blocks outbound Internet traffic for arosvc.azurecr.io

Resolution

Cluster Install:

September 5, 2024 Update:

  • As of September 5, 2024 all new cluster installs have had this issue patched. Customers can create a new cluster without any additional steps.

Upgraded cluster:

October 25th, 2024 Update:
The fleet wide maintenance is complete and no new edge blocks will be created. Customers can upgrade clusters currently on all 4.12.z streams to 4.13 and beyond along the unblocked edges.The safe edges are: 4.13.51, 4.13.52, 4.14.38, and 4.14.39.

Previously:
- Add an Internet route for arosvc.azurecr.io or add an Internet route for 0.0.0.0/0 by following the Azure documentation:
- Azure/Manage route table
- Azure/Virtual Network - UDR overview

Root Cause

When a new 4.14.z / 4.13.40 machine (master or worker) is provisioned one of the first steps in the boot sequence is for the machine-config-daemon-pull service to pull images from the ARO Azure Container Registry (ACR) at arosvc.azurecr.io:443. In the situation where the machine provisioning is dependent on the RP Gateway Service the ACR image pull will fail with a timeout and the new machine will not provision. This occurs when the network policy does not allow Internet-bound traffic to the ARO Azure Container Registry.

This is due to the ARO gateway proxy not being active early enough in the machine boot sequence. The machine-config-daemon-pull service attempts to pull an image before the ARO-provisioned dnsmasq service starts. The dnsmasq service enables the machine to resolve the ACR to the private link IP of the RP Gateway service and in turn pull the image and correctly provision the machine.

Diagnostic Steps

In either case the traffic is effectively blocked for ARO Azure Container Registry (ACR) at arosvc.azurecr.io:443.

  • 4.14.z / 4.13.40 Cluster installs fail with InternalServerError:

    • Private cluster with UDR specified and the subnet route tables lack an Internet route
    • To check if UDR is enabled on the cluster
      • az aro show returns UserDefinedRouting in NetworkProfile > OutboundType
    • To check the Azure route table
      • Azure Portal > Vnet > Subnet > Route tables > routes > Route to the internet has been overridden, or there’s no route to allow arosvc.azurecr.io.
    • When a Virtual Appliance or Firewall is used that blocks outbound Internet traffic or blocks traffic to arosvc.azurecr.io:443.
    • Reviewing virtual appliance/firewall rules is vendor-dependent and outside the scope of this article
  • Clusters upgraded to 4.14.z / 4.13.40 fail to provision new machines:

    • Private cluster with UDR specified and the subnet route tables lack an Internet route
    • To check if UDR is enabled on the cluster
      • az aro show returns UserDefinedRouting in NetworkProfile > OutboundType
    • To check the Azure route table
      • Azure Portal > Vnet > Subnet > Routetables > routes > Route to the internet has been overridden, or there’s no route to allow arosvc.azurecr.io.
    • When a Virtual Appliance or Firewall is used that blocks outbound Internet traffic or blocks traffic to arosvc.azurecr.io:443.
    • Reviewing virtual appliance/firewall rules is vendor-dependent and outside the scope of this article

This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.

Comments