When using the Nvidia GPU Operator, more nodes than needed are created by the Node Autoscaler

Solution Verified - Updated -

Environment

  • Red Hat OpenShift Container Platform 4.7.
  • NFD 4.6.0-202103010126.p0.
  • Nvidia GPU Operator 1.6.2.

Issue

  • The problem shows up when the following situation happens:

    1. A pod cannot be scheduled because not enough resources are available.
    2. Node Autoscaler creates a new node consequently, but it takes some time until the node can receive GPU workload.
  • Then, the following anomalous behaviour occurs:

    1. During the time when the node cannot still get GPU workload, the pod cannot still be scheduled and the Node Autoscaler's reaction is to create another new node.
    2. The previous point happens multiple times in a loop until one of the nodes created is ready to receive GPU workload.

Resolution

  • This is a known bug and Red Hat Engineering is working to solve it: Bug 1943194.

  • Workaround: if the label cluster-api/accelerator is applied in the machineset.spec.template.spec.metadata, the autoscaler will consider those nodes as unready until the GPU driver has been deployed. The value of the label must be the same as the value used in the ClusterAutoscaler resource field .spec.resourceLimits.gpu.type. Find more details on this comment.

Root Cause

The Cluster Autoscaler will attempt to match the value supplied for the type of GPU from its command line flags for limits with the value in the cluster-api/accelerator label of any Node that is joining the cluster. When these labels match, the Cluster Autoscaler is able to properly calculate the minimum and maximum counts for that resource type.

On OpenShift, the value supplied in the ClusterAutoscaler resource field .spec.resourceLimits.gpu.type is the value that is supplied to the command line flag for the Cluster Autoscaler. This value must match the value of the cluster-api/accelerator label on the MachineSet with the GPU resources or the Cluster Autoscaler will not be able to properly calculate the minimums and maximums.

This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.

Comments