Unable to access ARO cluster as the api-server was down

Environment

Azure Red Hat OpenShift (ARO)
- 4.x

Issue

Cluster console and api-server was inaccessible as api-server was down.
Error logs seen in api-server pods.

021/08/17 12:54:08 [ERROR] healthcheck has failed fatal=true err=Ran into error while performing 'GET' request: Get "https://localhost:6443/readyz": dial tcp [::1]:6443: connect: connection refused check=dependency-check
E0817 12:54:08.360714       1 leaderelection.go:321] error retrieving resource lock openshift-service-ca-operator/service-ca-operator-lock: Get "https://172.30.0.1:443/api/v1/namespaces/openshift-service-ca-operator/configmaps/service-ca-operator-lock?timeout=35s": net/http: request canceled (Client.Timeout exceeded while awaiting headers)
E0817 12:54:09.483564       1 leaderelection.go:321] error retrieving resource lock openshift-kube-apiserver/cert-regeneration-controller-lock: Get "https://localhost:6443/api/v1/namespaces/openshift-kube-apiserver/configmaps/cert-regeneration-controller-lock?timeout=35s": dial tcp [::1]:6443: connect: connection refused
2021/08/17 12:54:10 [ERROR] healthcheck has failed check=dependency-check fatal=true err=Ran into error while performing 'GET' request: Get "https://localhost:6443/readyz": dial tcp [::1]:6443: connect: connection refused
2021-08-17 12:54:10.333575 W | etcdserver/api/etcdhttp: /health error; no leader (status code 503)
E0817 12:54:11.764468       1 leaderelection.go:321] error retrieving resource lock openshift-kube-scheduler/cert-recovery-controller-lock: Get "https://localhost:6443/api/v1/namespaces/openshift-kube-scheduler/configmaps/cert-recovery-controller-lock?timeout=35s": dial tcp [::1]:6443: connect: connection refused
2021/08/17 12:54:12 [ERROR] healthcheck has failed err=Ran into error while performing 'GET' request: Get "https://localhost:6443/readyz": dial tcp [::1]:6443: connect: connection refused check=dependency-check fatal=true

Error logs seen in etcd pods.

2021-08-17 12:54:14.288356 W | etcdserver: read-only range request "key:\"/openshift.io/brokertemplateinstances/\" range_end:\"/openshift.io/brokertemplateinstances0\" limit:10000 " with result "error:etcdserver: request timed out" took too long (18.998453228s) to execute
2021-08-17 12:54:14.288398 W | etcdserver: read-only range request "key:\"/openshift.io/rangeallocations/\" range_end:\"/openshift.io/rangeallocations0\" limit:10000 " with result "error:etcdserver: request timed out" took too long (18.99853743s) to execute
2021-08-17 12:54:14.288455 W | etcdserver: read-only range request "key:\"/openshift.io/oauth/clientauthorizations/\" range_end:\"/openshift.io/oauth/clientauthorizations0\" limit:10000 " with result "error:etcdserver: request timed out" took too long (18.99850363s)

Master node was also down. In one of the master node, there was OOM mdsd errors:

Out of memory: Killed process 3807193 (mdsd) total-vm:1358184kB, anon-rss:807540kB, file-rss:0kB, shmem-rss:0kB, UID:0
[1166713.898518] oom_reaper: reaped process 3807193 (mdsd), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB

Redeployed master node, however, it still had issues with mdsd and Journal Service starting within the console logs.

[1166713.429170] Out of memory: Killed process 3807193 (mdsd) total-vm:1358184kB, anon-rss:807540kB, file-rss:0kB, shmem-rss:0kB, UID:0

Resolution

After redeploying master node, MachineAPI was still considering master node in FAILED state. This is a known bug and cluster needs upgrading to latest v4.7. Refer to BZ#1882169
Remove application workload which interferes with the functioning of OpenShift cluster operators.

Root Cause

Cluster operators were failing due to customer workload (consul) . It deployed mutating/validating webhookconfigurations and init pods are failing due to the unavailability of its daemonset. The pods were not ready as it was attempting to create too many pthreads within a single container, resulting it in failed state.
Jobs (customer workload) created in the cluster that was creating replicasets over and over again in a loop, which caused etcd DB size to grow over time. This put pressure on master nodes and eventually causing the cluster to go down.
ARO SREs are not responsible for end user workloads per the responsibility matrix, this results in best effort support policy.

Disclaimer: Links contained herein to external website(s) are provided for convenience only. Red Hat has not reviewed the links and is not responsible for the content or its availability. The inclusion of any link to an external website does not imply endorsement by Red Hat of the website or their entities, products or services. You agree that Red Hat is not responsible or liable for any loss or expenses that may result due to your use of (or reliance on) the external site or content.

Diagnostic Steps

Check the cluster operator and node status.

$ oc get co <cluster_operator>
$ oc get nodes

Reboot the master node.
If that doesn't bring up the master node, redeploy the master node.
Check the master node logs.

$ oc adm node-logs <master_node>

This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.

Select Your Language

Unable to access ARO cluster as the api-server was down

Environment

Issue

Resolution

Root Cause

Diagnostic Steps

Comments

Quick Links

Help

Site Info

Related Sites

About

Red Hat legal and privacy links

Red Hat legal and privacy links

Environment

Issue

Resolution

Root Cause

Diagnostic Steps

Comments

Quick Links

Help

Site Info

Related Sites

Systems Status

About

Red Hat legal and privacy links

Red Hat legal and privacy links