Unable to access ARO cluster as the api-server was down
Environment
- Azure Red Hat OpenShift (ARO)
- 4.x
Issue
- Cluster
consoleandapi-serverwas inaccessible asapi-serverwas down. - Error logs seen in
api-serverpods.
021/08/17 12:54:08 [ERROR] healthcheck has failed fatal=true err=Ran into error while performing 'GET' request: Get "https://localhost:6443/readyz": dial tcp [::1]:6443: connect: connection refused check=dependency-check
E0817 12:54:08.360714 1 leaderelection.go:321] error retrieving resource lock openshift-service-ca-operator/service-ca-operator-lock: Get "https://172.30.0.1:443/api/v1/namespaces/openshift-service-ca-operator/configmaps/service-ca-operator-lock?timeout=35s": net/http: request canceled (Client.Timeout exceeded while awaiting headers)
E0817 12:54:09.483564 1 leaderelection.go:321] error retrieving resource lock openshift-kube-apiserver/cert-regeneration-controller-lock: Get "https://localhost:6443/api/v1/namespaces/openshift-kube-apiserver/configmaps/cert-regeneration-controller-lock?timeout=35s": dial tcp [::1]:6443: connect: connection refused
2021/08/17 12:54:10 [ERROR] healthcheck has failed check=dependency-check fatal=true err=Ran into error while performing 'GET' request: Get "https://localhost:6443/readyz": dial tcp [::1]:6443: connect: connection refused
2021-08-17 12:54:10.333575 W | etcdserver/api/etcdhttp: /health error; no leader (status code 503)
E0817 12:54:11.764468 1 leaderelection.go:321] error retrieving resource lock openshift-kube-scheduler/cert-recovery-controller-lock: Get "https://localhost:6443/api/v1/namespaces/openshift-kube-scheduler/configmaps/cert-recovery-controller-lock?timeout=35s": dial tcp [::1]:6443: connect: connection refused
2021/08/17 12:54:12 [ERROR] healthcheck has failed err=Ran into error while performing 'GET' request: Get "https://localhost:6443/readyz": dial tcp [::1]:6443: connect: connection refused check=dependency-check fatal=true
- Error logs seen in
etcdpods.
2021-08-17 12:54:14.288356 W | etcdserver: read-only range request "key:\"/openshift.io/brokertemplateinstances/\" range_end:\"/openshift.io/brokertemplateinstances0\" limit:10000 " with result "error:etcdserver: request timed out" took too long (18.998453228s) to execute
2021-08-17 12:54:14.288398 W | etcdserver: read-only range request "key:\"/openshift.io/rangeallocations/\" range_end:\"/openshift.io/rangeallocations0\" limit:10000 " with result "error:etcdserver: request timed out" took too long (18.99853743s) to execute
2021-08-17 12:54:14.288455 W | etcdserver: read-only range request "key:\"/openshift.io/oauth/clientauthorizations/\" range_end:\"/openshift.io/oauth/clientauthorizations0\" limit:10000 " with result "error:etcdserver: request timed out" took too long (18.99850363s)
- Master node was also down. In one of the
masternode, there wasOOMmdsd errors:
Out of memory: Killed process 3807193 (mdsd) total-vm:1358184kB, anon-rss:807540kB, file-rss:0kB, shmem-rss:0kB, UID:0
[1166713.898518] oom_reaper: reaped process 3807193 (mdsd), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
- Redeployed
masternode, however, it still had issues with mdsd and Journal Service starting within the console logs.
[1166713.429170] Out of memory: Killed process 3807193 (mdsd) total-vm:1358184kB, anon-rss:807540kB, file-rss:0kB, shmem-rss:0kB, UID:0
Resolution
- After redeploying master node,
MachineAPIwas still considering master node inFAILEDstate. This is a known bug and cluster needs upgrading tolatest v4.7. Refer to BZ#1882169 - Remove
application workloadwhich interferes with the functioning of OpenShiftcluster operators.
Root Cause
- Cluster operators were failing due to customer workload (consul) . It deployed mutating/validating
webhookconfigurationsand init pods are failing due to the unavailability of itsdaemonset. The pods were not ready as it was attempting to create too many pthreads within a single container, resulting it in failed state. - Jobs (customer workload) created in the cluster that was creating
replicasetsover and over again in a loop, which causedetcd DBsize to grow over time. This put pressure on master nodes and eventually causing the cluster to go down. ARO SREsare not responsible for end user workloads per the responsibility matrix, this results in best effort support policy.
Disclaimer: Links contained herein to external website(s) are provided for convenience only. Red Hat has not reviewed the links and is not responsible for the content or its availability. The inclusion of any link to an external website does not imply endorsement by Red Hat of the website or their entities, products or services. You agree that Red Hat is not responsible or liable for any loss or expenses that may result due to your use of (or reliance on) the external site or content.
Diagnostic Steps
- Check the cluster operator and node status.
$ oc get co <cluster_operator>
$ oc get nodes
- Reboot the master node.
- If that doesn't bring up the master node, redeploy the master node.
- Check the master node logs.
$ oc adm node-logs <master_node>
This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.
Comments