Unable to access ARO cluster as the api-server was down

Solution Verified - Updated -

Environment

  • Azure Red Hat OpenShift (ARO)
    • 4.x

Issue

  • Cluster console and api-server was inaccessible as api-server was down.
  • Error logs seen in api-server pods.
021/08/17 12:54:08 [ERROR] healthcheck has failed fatal=true err=Ran into error while performing 'GET' request: Get "https://localhost:6443/readyz": dial tcp [::1]:6443: connect: connection refused check=dependency-check
E0817 12:54:08.360714       1 leaderelection.go:321] error retrieving resource lock openshift-service-ca-operator/service-ca-operator-lock: Get "https://172.30.0.1:443/api/v1/namespaces/openshift-service-ca-operator/configmaps/service-ca-operator-lock?timeout=35s": net/http: request canceled (Client.Timeout exceeded while awaiting headers)
E0817 12:54:09.483564       1 leaderelection.go:321] error retrieving resource lock openshift-kube-apiserver/cert-regeneration-controller-lock: Get "https://localhost:6443/api/v1/namespaces/openshift-kube-apiserver/configmaps/cert-regeneration-controller-lock?timeout=35s": dial tcp [::1]:6443: connect: connection refused
2021/08/17 12:54:10 [ERROR] healthcheck has failed check=dependency-check fatal=true err=Ran into error while performing 'GET' request: Get "https://localhost:6443/readyz": dial tcp [::1]:6443: connect: connection refused
2021-08-17 12:54:10.333575 W | etcdserver/api/etcdhttp: /health error; no leader (status code 503)
E0817 12:54:11.764468       1 leaderelection.go:321] error retrieving resource lock openshift-kube-scheduler/cert-recovery-controller-lock: Get "https://localhost:6443/api/v1/namespaces/openshift-kube-scheduler/configmaps/cert-recovery-controller-lock?timeout=35s": dial tcp [::1]:6443: connect: connection refused
2021/08/17 12:54:12 [ERROR] healthcheck has failed err=Ran into error while performing 'GET' request: Get "https://localhost:6443/readyz": dial tcp [::1]:6443: connect: connection refused check=dependency-check fatal=true
  • Error logs seen in etcd pods.
2021-08-17 12:54:14.288356 W | etcdserver: read-only range request "key:\"/openshift.io/brokertemplateinstances/\" range_end:\"/openshift.io/brokertemplateinstances0\" limit:10000 " with result "error:etcdserver: request timed out" took too long (18.998453228s) to execute
2021-08-17 12:54:14.288398 W | etcdserver: read-only range request "key:\"/openshift.io/rangeallocations/\" range_end:\"/openshift.io/rangeallocations0\" limit:10000 " with result "error:etcdserver: request timed out" took too long (18.99853743s) to execute
2021-08-17 12:54:14.288455 W | etcdserver: read-only range request "key:\"/openshift.io/oauth/clientauthorizations/\" range_end:\"/openshift.io/oauth/clientauthorizations0\" limit:10000 " with result "error:etcdserver: request timed out" took too long (18.99850363s)
  • Master node was also down. In one of the master node, there was OOM mdsd errors:
Out of memory: Killed process 3807193 (mdsd) total-vm:1358184kB, anon-rss:807540kB, file-rss:0kB, shmem-rss:0kB, UID:0
[1166713.898518] oom_reaper: reaped process 3807193 (mdsd), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
  • Redeployed master node, however, it still had issues with mdsd and Journal Service starting within the console logs.
[1166713.429170] Out of memory: Killed process 3807193 (mdsd) total-vm:1358184kB, anon-rss:807540kB, file-rss:0kB, shmem-rss:0kB, UID:0

Resolution

  • After redeploying master node, MachineAPI was still considering master node in FAILED state. This is a known bug and cluster needs upgrading to latest v4.7. Refer to BZ#1882169
  • Remove application workload which interferes with the functioning of OpenShift cluster operators.

Root Cause

  • Cluster operators were failing due to customer workload (consul) . It deployed mutating/validating webhookconfigurations and init pods are failing due to the unavailability of its daemonset. The pods were not ready as it was attempting to create too many pthreads within a single container, resulting it in failed state.
  • Jobs (customer workload) created in the cluster that was creating replicasets over and over again in a loop, which caused etcd DB size to grow over time. This put pressure on master nodes and eventually causing the cluster to go down.
  • ARO SREs are not responsible for end user workloads per the responsibility matrix, this results in best effort support policy.

Disclaimer: Links contained herein to external website(s) are provided for convenience only. Red Hat has not reviewed the links and is not responsible for the content or its availability. The inclusion of any link to an external website does not imply endorsement by Red Hat of the website or their entities, products or services. You agree that Red Hat is not responsible or liable for any loss or expenses that may result due to your use of (or reliance on) the external site or content.

Diagnostic Steps

  • Check the cluster operator and node status.
$ oc get co <cluster_operator>
$ oc get nodes
  • Reboot the master node.
  • If that doesn't bring up the master node, redeploy the master node.
  • Check the master node logs.
$ oc adm node-logs <master_node>

This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.

Comments