curling Azure loadbalancer health probes fails & changing health probe config fails
Environment
- OpenShift ARO 4.x
Issue
- Cannot edit the health probe settings in the Azure console
- curling the health probe IP:port/healthz as configured in the Azure console throws a 503 error:
# curl -vvv http://10.20.0.7:12345/healthz
* Trying 10.20.0.7...
* TCP_NODELAY set
* Connected to 10.20.0.7 (10.20.0.7) port 12345 (#0)
> GET /healthz HTTP/1.1
> Host: 10.20.0.7:12345
> User-Agent: curl/7.61.1
> Accept: */*
>
< HTTP/1.1 503 Service Unavailable
< Content-Type: application/json
< X-Content-Type-Options: nosniff
< Date: Tue, 15 Feb 2022 08:39:30 GMT
< Content-Length: 105
<
* Connection #0 to host 10.20.0.7 left intact
Resolution
The ingress controller will spin up two replicas of the router pod on two worker nodes. These, and only these, worker nodes will respond to a curl on the /healthz path. If you have five workers, for example, you will find that three will return a 503 error and the remaining two will succeed.
To find the worker nodes on which the routers are running, you can do the following:
$ oc get -n openshift-ingress pods -o wide
Looking at the output from the above command, you should then be able to curl those worker IPs on the healthCheckNodePort. Note that this port will be a random value generated during installation of the cluster.
The curl syntax should thus be:
$ curl <nodeIP where router pod is>:<healthCheckNodePort>/healthz
Worker node replacement and security
If one or more of the worker nodes is replaced where the router pods were running, the ingress controller will then update the Azure health probe with the updated IP address automatically. This is one of the reasons that the security of that configuration is so restrictive -- you should never need to manually change it and it is liable to be overwritten by the ingress controller anyway.
Root Cause
There are three health probes set up by the ARO installation:
1 x API server health probe on port 6443
2 x router service health probes on a random port
From a general standpoint in Azure, the health probes in Load Balancers, configured along with Load Balancing rules, are used to determine which resources in the Backend pool are healthy and ready to receive traffic.
For instance, in a case of a public cluster, in the public load balancer, there is a load balancing rule of api server 6443:6443 with all the VMS in the resource groups in the backend pool. The health probe on port 6443 lets the Load Balancer know which resources are ready to take this traffic, in that case only the master nodes which are up and running and with api server pods up and ready. In particular, the worker nodes, which are part of the backend pool, will not be receiving the traffic for the api server, as not responding to this health probe.
To check further on this, one must retrieve the router services objects in openshift-ingress namespace. In the spec of this object, there is healthCheckNodePort value defined, which is the same port value that one can retrieve in the Load Balancer health probes for the routers: in the openshift doc, it is said that "External systems (e.g. load-balancers) can use this port to determine if a given node holds endpoints for this service or not. " This port value when not specified is picked at random in the cluster's nodeport range.
As specified in the ARO Service Definition, resources in the Cluster Resource Group are not supposed to be changed by customers directly. Any change to the cluster configuration shall be made through Openshift API.
Please also refer to Concepts - Networking diagram for Azure Red Hat on OpenShift 4 | Microsoft Docs for some further resource on network in ARO.
Disclaimer: Links contained herein to external website(s) are provided for convenience only. Red Hat has not reviewed the links and is not responsible for the content or its availability. The inclusion of any link to an external website does not imply endorsement by Red Hat of the website or their entities, products or services. You agree that Red Hat is not responsible or liable for any loss or expenses that may result due to your use of (or reliance on) the external site or content.
Diagnostic Steps
Curl /healthz on the worker nodes using the health probe port and observe the output.
This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.
Comments