clusterautoscaler operator not working properly
Environment
- Red Hat OpenShift Service on AWS (ROSA) 4
Issue
- Cluster autoscaler did not scale up the new machine into the cluster.
- Elasticsearch pods in Pending state in custom namespace "custom-xxxx" because of resource issue.
- To verify if the autoscaler is scaling the nodes at all.
Resolution
Possible ways :
- Increase the worker node size for scaling out, to accommodate the current workload.
- Decrease the requested memory for the elasticsearch workload (if this will not impact the workloads)
Root Cause
Cluster Autoscalerwould not scale nodes, when there is an attempt to set aresourceRequestfor 12Gi of memory in elasticsearch as the node sizes is too small to accommodate this request.
Diagnostic Steps
- Check the Elasticsearch pods under the custom namespace
$ oc get po -n custom-xxxx
NAME READY STATUS RESTARTS AGE
elasticsearch-es-default-0 0/1 Pending 0 62m
elasticsearch-es-default-1 0/1 Pending 0 62m
elasticsearch-es-default-2 0/1 Pending 0 62m
elasticsearch-es-default-3 0/1 Pending 0 62m
- Check the cluster-autoscaler-default pod logs under openshift-machine-api
$ oc logs cluster-autoscaler-operator-747cd95f78-vv9n7 -c cluster-autoscaler-operator
I0926 06:18:33.665333 1 klogx.go:86] Pod openshift-ingress/router-rebellion-xxxxxxxxx-xxxxxx is unschedulable
I0926 06:18:33.665349 1 klogx.go:86] Pod openshift-ingress/router-rebellion-xxxxxxxxx-xxxxxx is unschedulable
I0926 06:18:33.665354 1 klogx.go:86] Pod custom-xxxx/elasticsearch-es-default-3 is unschedulable
I0926 06:18:33.665358 1 klogx.go:86] Pod custom-xxxx/elasticsearch-es-default-1 is unschedulable
I0926 06:18:33.665362 1 klogx.go:86] Pod custom-xxxx/elasticsearch-es-default-0 is unschedulable
I0926 06:18:33.665365 1 klogx.go:86] Pod openshift-velero/managed-velero-operator-registry-xxxx is unschedulable
- To verify if the autoscaler is scaling the nodes at all, you can set the the minimum required nodes per group higher than the currently running nodes:
$ oc get machineautoscalers.autoscaling.openshift.io -n openshift-machine-api
NAME REF KIND REF NAME MIN MAX AGE
custom-xxxx-xxxx-worker1-eu-central-1a MachineSet custom-xxxx-xxxx-worker1-eu-central-1a 1 7 31h
custom-xxxx-xxxx-worker2-eu-central-1b MachineSet custom-xxxx-xxxx-worker2-eu-central-1b 1 7 31h
custom-xxxx-xxxx-worker3-eu-central-1c MachineSet custom-xxxx-xxxx-worker3-eu-central-1c 1 7 31h
- Currently running machines:
$ oc get machinesets -n openshift-machine-api
NAME DESIRED CURRENT READY AVAILABLE AGE
custom-xxxx-xxxx-infra1-eu-central-1a 1 1 1 1 489d
custom-xxxx-xxxx-infra2-eu-central-1b 1 1 1 1 489d
custom-xxxx-xxxx-worker1-eu-central-1a 2 2 2 2 489d
custom-xxxx-xxxx-worker2-eu-central-1b 1 1 1 1 489d
custom-xxxx-xxxx-worker3-eu-central-1c 2 2 2 2 489d
- So at most there are 2 machines running per group, so you could set the MIN to 3, to see if this will trigger the autoscaler to scale at all ( this will conclude that the issue is with
resourceRequestnot the autoscaler )
This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.
Comments