Cluster Autoscaler not balancing nodes across Availability Zones in OpenShift 4

Solution Verified - Updated -

Environment

  • Red Hat OpenShift Container Platform (RHOCP)
    • 4
  • Red Hat OpenShift Service on AWS (ROSA)
    • 4
  • Red Hat OpenShift Dedicated (OSD)
    • 4
  • Cluster Autoscaler

Issue

  • Having 2 Availability Zones, such as west-1a and west-1b, the MachineAutoScaler is configured for MachineSets of both the zones. But the Cluster Autoscaler does not provision worker nodes evenly across both the MachineSets.
  • Nodes scaled up unevenly across Availability Zones when using Cluster Autoscaler in OpenShift 4.
  • Is it possible to use the balanceSimilarNodeGroups option in the Cluster Autoscaler in OpenShift 4?

Resolution

Setting the balanceSimilarNodeGroups property to true in the ClusterAutoscaler resource as shown below will help to balance OCP nodes across the different MachineSets:

apiVersion: "autoscaling.openshift.io/v1"
kind: "ClusterAutoscaler"
metadata:
  name: "default"
spec:
  balanceSimilarNodeGroups: true

Note: The balanceSimilarNodeGroups in the default ClusterAutoscaler is already configured to false in OSD/ROSA clusters ROSA-Doc. As part of HIVE-1976 to allow customer to configure.

Root Cause

Disclaimer: Links contained herein to external website(s) are provided for convenience only. Red Hat has not reviewed the links and is not responsible for the content or its availability. The inclusion of any link to an external website does not imply endorsement by Red Hat of the website or their entities, products or services. You agree that Red Hat is not responsible or liable for any loss or expenses that may result due to your use of (or reliance on) the external site or content.

The balanceSimilarNodeGroups enables/disables the --balance-similar-node-groups feature of the Cluster Autocaler. This feature will automatically identify node groups with the same instance type and the same set of labels and try to keep the respective sizes of those node groups balanced.

Note: currently the balancing is only done at scale-up.

Diagnostic Steps

Check the config of the default ClusterAutoscaler:

$ oc get clusterautoscaler default -o yaml
apiVersion: autoscaling.openshift.io/v1
kind: ClusterAutoscaler
[...]
spec:
  balanceSimilarNodeGroups: true
[...]

This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.

Comments