Chapter 7. Enabling GPU support in OpenShift Data Science

To ensure that your data scientists can utilize compute-heavy workloads in their models, you can enable graphics processing units (GPUs) in OpenShift Data Science. To make GPUs available in OpenShift Data Science, after you install OpenShift Data Science, you must install the NVIDIA GPU Add-On. This add-on locates and enables any GPU-enabled worker nodes in your cluster, making GPU instance types available for selection. After you have installed the NVIDIA GPU Add-On, and you have ensured there are GPU-enabled worker nodes in your cluster, your data scientists can select one of the GPU-enabled notebooks in JupyterHub, along with the number of GPUs they require for their data science work.

Red Hat recommends that you use a separate machine pool for GPU nodes that have the nvidia.com/gpu NoSchedule taint. If you edit an existing machine pool to add this taint, you must first scale the machine pool down to zero nodes, and then increase the machine pool to the number of nodes that you require. This ensures that the new taint is applied to all nodes in the machine pool. To ensure consistent behavior across all nodes in the machine pool, Red Hat recommends that you increase the scale of your machine nodes promptly. As scaling nodes to zero has a disruptive effect on your deployment, Red Hat recommends that you perform this action as soon as possible, while considering your service usage patterns when selecting an appropriate time.

Prerequisites

  • You have credentials for OpenShift Cluster Manager (https://console.redhat.com/openshift/).
  • You are part of the cluster-admins user group in OpenShift Dedicated.
  • You have provisioned a cluster that contains enough resources to satisfy the requirements of OpenShift Data Science and the NVIDIA GPU Add-On.
  • You have installed and logged in to Red Hat OpenShift Data Science.
  • You must have installed and logged in to the OpenShift CLI (oc).

Procedure

  1. Navigate to your cluster on OpenShift Cluster Manager.

    1. Log in to OpenShift Cluster Manager (https://console.redhat.com/openshift/).
    2. Click Clusters.

      The Clusters page opens.

    3. Click the name of the cluster that you have installed OpenShift Data Science on.

      The Details page for the cluster opens.

  2. Add a machine pool for nodes with GPUs.

    1. Click the Machine pools tab.
    2. Click the Add machine pool button.

      The Add machine pool window opens.

    3. Specify a Machine pool name.
    4. Set a Worker node instance type. Ensure that the instance type provides one or more GPUs.
    5. Set a Worker node count of at least one.
    6. Click Edit node labels and taints to expand the Node labels section.
    7. Under Taints, add a taint with the Key of nvidia.com/gpu and an Effect of NoSchedule. The Value can be set to any string, for example, true.

      Note

      When setting the taint, ensure the taint is correctly declared without typographical errors.

    8. Click Add machine pool.

      Your machine pool is created.

    9. Confirm that the Taint you specified is visible on the Details page for the machine pool, for example, nvidia.com/gpu=true:NoSchedule.
  3. Install the NVIDIA GPU Operator.

    1. Click the Add-ons tab.
    2. Click on the NVIDIA GPU Operator card.
    3. Click Install.

Verification

  • In OpenShift Cluster Manager, under the Add-ons tab for the cluster, confirm that the NVIDIA GPU operator is installed.
  • In OpenShift Dedicated web console, under ComputeNodes, confirm that each node in the new machine pool has the nvidia.com/gpu taint set, for example, nvidia.com/gpu=true:NoSchedule.
  • The jupyterhub-singleuser-profiles ConfigMap, located in the redhat-ods-applications project on the WorkloadsConfigMaps page, contains the following NoSchedule toleration:

      gpuTypes:
      - type: gpu_one
        node_tolerations:
        - key: provider
          operator: Equal
          value: gpu-node
          effect: NoSchedule
        # This is the default NoSchedule toleration that is suported by the NVIDIA GPU operator
      - type: nvidia_gpu
        node_tolerations:
        - key: "nvidia.com/gpu"
          operator: Exists
          effect: NoSchedule
  • Check that GPU-enabled functionality is available in Red Hat OpenShift Data Science.

    • Check and validate the nvidia-device-plugin-validator logs. At the OpenShift CLI, enter the following command:

      oc logs nvidia-device-plugin-validator-<alpha-numeric-string> -n redhat-gpu-operator

      Where <alpha-numeric-string> is a randomly generated alpha-numeric string.

      If the validation is successful, the following response is returned:

      device-plugin validation is successful
    • Red Hat recommends that you run a sample GPU application to ensure GPU-enabled models can successfully run on Red Hat OpenShift Data Science. For more information, see Running a sample GPU application.
    • Run the nvidia-smi command within the relevant pod to test the GPU utilization of your sample project. For more information, see Getting information about the GPU.