curling Azure loadbalancer health probes fails & changing health probe config fails

Solution In Progress - Updated 2024-06-13T21:44:23+00:00 -

Environment

OpenShift ARO 4.x

Issue

Cannot edit the health probe settings in the Azure console
curling the health probe IP:port/healthz as configured in the Azure console throws a 503 error:

# curl -vvv    http://10.20.0.7:12345/healthz
*   Trying 10.20.0.7...
* TCP_NODELAY set
* Connected to 10.20.0.7 (10.20.0.7) port 12345 (#0)
> GET /healthz HTTP/1.1
> Host: 10.20.0.7:12345
> User-Agent: curl/7.61.1
> Accept: */*
> 
< HTTP/1.1 503 Service Unavailable
< Content-Type: application/json
< X-Content-Type-Options: nosniff
< Date: Tue, 15 Feb 2022 08:39:30 GMT
< Content-Length: 105
< 
* Connection #0 to host 10.20.0.7 left intact

Resolution

The ingress controller will spin up two replicas of the router pod on two worker nodes. These, and only these, worker nodes will respond to a curl on the /healthz path. If you have five workers, for example, you will find that three will return a 503 error and the remaining two will succeed.

To find the worker nodes on which the routers are running, you can do the following:

  $ oc get -n openshift-ingress pods -o wide

Looking at the output from the above command, you should then be able to curl those worker IPs on the healthCheckNodePort. Note that this port will be a random value generated during installation of the cluster.

The curl syntax should thus be:

  $ curl <nodeIP where router pod is>:<healthCheckNodePort>/healthz

Worker node replacement and security

If one or more of the worker nodes is replaced where the router pods were running, the ingress controller will then update the Azure health probe with the updated IP address automatically. This is one of the reasons that the security of that configuration is so restrictive -- you should never need to manually change it and it is liable to be overwritten by the ingress controller anyway.

Root Cause

There are three health probes set up by the ARO installation:

1 x API server health probe on port 6443
2 x router service health probes on a random port

From a general standpoint in Azure, the health probes in Load Balancers, configured along with Load Balancing rules, are used to determine which resources in the Backend pool are healthy and ready to receive traffic.

For instance, in a case of a public cluster, in the public load balancer, there is a load balancing rule of api server 6443:6443 with all the VMS in the resource groups in the backend pool. The health probe on port 6443 lets the Load Balancer know which resources are ready to take this traffic, in that case only the master nodes which are up and running and with api server pods up and ready. In particular, the worker nodes, which are part of the backend pool, will not be receiving the traffic for the api server, as not responding to this health probe.

To check further on this, one must retrieve the router services objects in openshift-ingress namespace. In the spec of this object, there is healthCheckNodePort value defined, which is the same port value that one can retrieve in the Load Balancer health probes for the routers: in the openshift doc, it is said that "External systems (e.g. load-balancers) can use this port to determine if a given node holds endpoints for this service or not. " This port value when not specified is picked at random in the cluster's nodeport range.

As specified in the ARO Service Definition, resources in the Cluster Resource Group are not supposed to be changed by customers directly. Any change to the cluster configuration shall be made through Openshift API.

Please also refer to Concepts - Networking diagram for Azure Red Hat on OpenShift 4 | Microsoft Docs for some further resource on network in ARO.

Disclaimer: Links contained herein to external website(s) are provided for convenience only. Red Hat has not reviewed the links and is not responsible for the content or its availability. The inclusion of any link to an external website does not imply endorsement by Red Hat of the website or their entities, products or services. You agree that Red Hat is not responsible or liable for any loss or expenses that may result due to your use of (or reliance on) the external site or content.

Diagnostic Steps

Curl /healthz on the worker nodes using the health probe port and observe the output.

This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.

Select Your Language

curling Azure loadbalancer health probes fails & changing health probe config fails

Environment

Issue

Resolution

Worker node replacement and security

Root Cause

Diagnostic Steps

Comments

Quick Links

Help

Site Info

Related Sites

About

Red Hat legal and privacy links

Red Hat legal and privacy links

Environment

Issue

Resolution

Worker node replacement and security

Root Cause

Diagnostic Steps

Comments

Quick Links

Help

Site Info

Related Sites

Systems Status

About

Red Hat legal and privacy links

Red Hat legal and privacy links