Chapter 3. Managing cluster resources

3.1. Configuring the default PVC size for your cluster

To configure how resources are claimed within your OpenShift AI cluster, you can change the default size of the cluster’s persistent volume claim (PVC) ensuring that the storage requested matches your common storage workflow. PVCs are requests for resources in your cluster and also act as claim checks to the resource.

Prerequisites

  • You have logged in to Red Hat OpenShift AI.
  • You are part of the administrator group for OpenShift AI in OpenShift.
Note

Changing the PVC setting restarts the Jupyter pod and makes Jupyter unavailable for up to 30 seconds. As a workaround, it is recommended that you perform this action outside of your organization’s typical working day.

Procedure

  1. From the OpenShift AI dashboard, click SettingsCluster settings.
  2. Under PVC size, enter a new size in gibibytes. The minimum size is 1 GiB, and the maximum size is 16384 GiB.
  3. Click Save changes.

Verification

  • New PVCs are created with the default storage size that you configured.

Additional resources

3.2. Restoring the default PVC size for your cluster

To change the size of resources utilized within your OpenShift AI cluster, you can restore the default size of your cluster’s persistent volume claim (PVC).

Prerequisites

  • You have logged in to Red Hat OpenShift AI.
  • You are part of the administrator group for OpenShift AI in OpenShift.

Procedure

  1. From the OpenShift AI dashboard, click SettingsCluster settings.
  2. Click Restore Default to restore the default PVC size of 20GiB.
  3. Click Save changes.

Verification

  • New PVCs are created with the default storage size of 20 GiB.

Additional resources

3.3. Overview of accelerators

If you work with large data sets, you can use accelerators to optimize the performance of your data science models in OpenShift AI. With accelerators, you can scale your work, reduce latency, and increase productivity. You can use accelerators in OpenShift AI to assist your data scientists in the following tasks:

  • Natural language processing (NLP)
  • Inference
  • Training deep neural networks
  • Data cleansing and data processing

OpenShift AI supports the following accelerators:

  • NVIDIA graphics processing units (GPUs)

    • To use compute-heavy workloads in your models, you can enable NVIDIA graphics processing units (GPUs) in OpenShift AI.
    • To enable GPUs on OpenShift, you must install the NVIDIA GPU Operator.
  • Habana Gaudi devices (HPUs)

    • Habana, an Intel company, provides hardware accelerators intended for deep learning workloads. You can use the Habana libraries and software associated with Habana Gaudi devices available from your notebook.
    • Before you can enable Habana Gaudi devices in OpenShift AI, you must install the necessary dependencies and the version of the HabanaAI Operator that matches the Habana version of the HabanaAI workbench image in your deployment. For more information about how to enable your OpenShift environment for Habana Gaudi devices, see HabanaAI Operator v1.10 for OpenShift and HabanaAI Operator v1.13 for OpenShift.
    • You can enable Habana Gaudi devices on-premises or with AWS DL1 compute nodes on an AWS instance.

Before you can use an accelerator in OpenShift AI, your OpenShift instance must contain an associated accelerator profile. For accelerators that are new to your deployment, you must configure an accelerator profile for the accelerator in context. You can create an accelerator profile from the SettingsAccelerator profiles page on the OpenShift AI dashboard. If your deployment contains existing accelerators that had associated accelerator profiles already configured, an accelerator profile is automatically created after you upgrade to the latest version of OpenShift AI.

3.3.1. Enabling GPU support in OpenShift AI

Optionally, to ensure that your data scientists can use compute-heavy workloads in their models, you can enable graphics processing units (GPUs) in OpenShift AI.

Important

The NVIDIA GPU add-on is no longer supported. Instead, enable GPUs by installing the NVIDIA GPU Operator. If your deployment has a previously-installed NVIDIA GPU add-on, before you install the NVIDIA GPU Operator, use Red Hat OpenShift Cluster Manager to uninstall the NVIDIA GPU add-on from your cluster.

Prerequisites

  • You have logged in to your OpenShift cluster.
  • You have the cluster-admin role in your OpenShift cluster.

Procedure

  1. To enable GPU support on an OpenShift cluster, follow the instructions here: NVIDIA GPU Operator on Red Hat OpenShift Container Platform in the NVIDIA documentation.
  2. Delete the migration-gpu-status ConfigMap.

    1. In the OpenShift web console, switch to the Administrator perspective.
    2. Set the Project to All Projects or redhat-ods-applications to ensure you can see the appropriate ConfigMap.
    3. Search for the migration-gpu-status ConfigMap.
    4. Click the action menu (⋮) and select Delete ConfigMap from the list.

      The Delete ConfigMap dialog appears.

    5. Inspect the dialog and confirm that you are deleting the correct ConfigMap.
    6. Click Delete.
  3. Restart the dashboard replicaset.

    1. In the OpenShift web console, switch to the Administrator perspective.
    2. Click WorkloadsDeployments.
    3. Set the Project to All Projects or redhat-ods-applications to ensure you can see the appropriate deployment.
    4. Search for the rhods-dashboard deployment.
    5. Click the action menu (⋮) and select Restart Rollout from the list.
    6. Wait until the Status column indicates that all pods in the rollout have fully restarted.

Verification

  • The NVIDIA GPU Operator appears on the OperatorsInstalled Operators page in the OpenShift web console.
  • The reset migration-gpu-status instance is present in the Instances tab on the AcceleratorProfile custom resource definition (CRD) details page.

After installing the NVIDIA GPU Operator, create an accelerator profile as described in Working with accelerator profiles.

3.3.2. Enabling Habana Gaudi devices

Before you can use Habana Gaudi devices in OpenShift AI, you must install the necessary dependencies and deploy the HabanaAI Operator.

Prerequisites

  • You have logged in to OpenShift.
  • You have the cluster-admin role in OpenShift.

Procedure

  1. To enable Habana Gaudi devices in OpenShift AI, follow the instructions at HabanaAI Operator for OpenShift.
  2. From the OpenShift AI dashboard, click SettingsAccelerator profiles.

    The Accelerator profiles page appears, displaying existing accelerator profiles. To enable or disable an existing accelerator profile, on the row containing the relevant accelerator profile, click the toggle in the Enable column.

  3. Click Create accelerator profile.

    The Create accelerator profile dialog opens.

  4. In the Name field, enter a name for the Habana Gaudi device.
  5. In the Identifier field, enter a unique string that identifies the Habana Gaudi device, for example, habana.ai/gaudi.
  6. Optional: In the Description field, enter a description for the Habana Gaudi device.
  7. To enable or disable the accelerator profile for the Habana Gaudi device immediately after creation, click the toggle in the Enable column.
  8. Optional: Add a toleration to schedule pods with matching taints.

    1. Click Add toleration.

      The Add toleration dialog opens.

    2. From the Operator list, select one of the following options:

      • Equal - The key/value/effect parameters must match. This is the default.
      • Exists - The key/effect parameters must match. You must leave a blank value parameter, which matches any.
    3. From the Effect list, select one of the following options:

      • None
      • NoSchedule - New pods that do not match the taint are not scheduled onto that node. Existing pods on the node remain.
      • PreferNoSchedule - New pods that do not match the taint might be scheduled onto that node, but the scheduler tries not to. Existing pods on the node remain.
      • NoExecute - New pods that do not match the taint cannot be scheduled onto that node. Existing pods on the node that do not have a matching toleration are removed.
    4. In the Key field, enter the toleration key habana.ai/gaudi. The key is any string, up to 253 characters. The key must begin with a letter or number, and may contain letters, numbers, hyphens, dots, and underscores.
    5. In the Value field, enter a toleration value. The value is any string, up to 63 characters. The value must begin with a letter or number, and may contain letters, numbers, hyphens, dots, and underscores.
    6. In the Toleration Seconds section, select one of the following options to specify how long a pod stays bound to a node that has a node condition.

      • Forever - Pods stays permanently bound to a node.
      • Custom value - Enter a value, in seconds, to define how long pods stay bound to a node that has a node condition.
    7. Click Add.
  9. Click Create accelerator profile.

Verification

  • From the Administrator perspective, the following Operators appear on the OperatorsInstalled Operators page.

    • HabanaAI
    • Node Feature Discovery (NFD)
    • Kernel Module Management (KMM)
  • The Accelerator list displays the Habana Gaudi accelerator on the Start a notebook server page. After you select an accelerator, the Number of accelerators field appears, which you can use to choose the number of accelerators for your notebook server.
  • The accelerator profile appears on the Accelerator profiles page
  • The accelerator profile appears on the Instances tab on the details page for the AcceleratorProfile custom resource definition (CRD).

3.4. Allocating additional resources to OpenShift AI users

As a cluster administrator, you can allocate additional resources to a cluster to support compute-intensive data science work. This support includes increasing the number of nodes in the cluster and changing the cluster’s allocated machine pool.

Prerequisites

  • You have credentials for administering clusters in OpenShift Cluster Manager (https://console.redhat.com/openshift/). For more information about configuring administrative access in OpenShift Cluster Manager, see Configuring access to clusters in OpenShift Cluster Manager.
  • If you intend to increase the size of a machine pool by using accelerators, you have ensured that your OpenShift cluster supports them.
  • You have an AWS or GCP instance with the capacity to create larger container sizes. For compute-intensive operations, your AWS or GCP instance has enough capacity to accommodate the largest container size, XL.

Procedure

  1. Log in to OpenShift Cluster Manager (https://console.redhat.com/openshift/).
  2. Click Clusters.

    The Clusters page opens.

  3. Click the name of the cluster you want to allocate additional resources to.
  4. Click ActionsEdit node count.
  5. Select a Machine pool from the list.
  6. Select the number of nodes assigned to the machine pool from the Node count list.
  7. Click Apply.

Verification

  • The additional resources that you allocated to the cluster appear on the Machine Pools tab.