Chapter 5. Working with accelerators

Use accelerators, such as NVIDIA GPUs and Habana Gaudi devices, to optimize the performance of your end-to-end data science workflows.

5.1. Overview of accelerators

If you work with large data sets, you can use accelerators to optimize the performance of your data science models in OpenShift AI. With accelerators, you can scale your work, reduce latency, and increase productivity. You can use accelerators in OpenShift AI to assist your data scientists in the following tasks:

  • Natural language processing (NLP)
  • Inference
  • Training deep neural networks
  • Data cleansing and data processing

OpenShift AI supports the following accelerators:

  • NVIDIA graphics processing units (GPUs)

    • To use compute-heavy workloads in your models, you can enable NVIDIA graphics processing units (GPUs) in OpenShift AI.
    • To enable GPUs on OpenShift, you must install the NVIDIA GPU Operator.
  • Habana Gaudi devices (HPUs)

    • Habana, an Intel company, provides hardware accelerators intended for deep learning workloads. You can use the Habana libraries and software associated with Habana Gaudi devices available from your notebook.
    • Before you can enable Habana Gaudi devices in OpenShift AI, you must install the necessary dependencies and the version of the HabanaAI Operator that matches the Habana version of the HabanaAI workbench image in your deployment. For more information about how to enable your OpenShift environment for Habana Gaudi devices, see HabanaAI Operator v1.10 for OpenShift and HabanaAI Operator v1.13 for OpenShift.
    • You can enable Habana Gaudi devices on-premises or with AWS DL1 compute nodes on an AWS instance.

Before you can use an accelerator in OpenShift AI, your OpenShift instance must contain an associated accelerator profile. For accelerators that are new to your deployment, you must configure an accelerator profile for the accelerator in context. You can create an accelerator profile from the SettingsAccelerator profiles page on the OpenShift AI dashboard. If your deployment contains existing accelerators that had associated accelerator profiles already configured, an accelerator profile is automatically created after you upgrade to the latest version of OpenShift AI.

5.2. Working with accelerator profiles

To configure accelerators for your data scientists to use in OpenShift AI, you must create an associated accelerator profile. An accelerator profile is a custom resource definition (CRD) on OpenShift that has an AcceleratorProfile resource, and defines the specification of the accelerator. You can create and manage accelerator profiles by selecting SettingsAccelerator profiles on the OpenShift AI dashboard.

For accelerators that are new to your deployment, you must manually configure an accelerator profile for each accelerator. If your deployment contains an accelerator before you upgrade, the associated accelerator profile remains after the upgrade. You can manage the accelerators that appear to your data scientists by assigning specific accelerator profiles to your custom notebook images. This example shows the code for a Habana Gaudi 1 accelerator profile:

---
apiVersion: dashboard.opendatahub.io/v1alpha
kind: AcceleratorProfile
metadata:
  name: hpu-profile-first-gen-gaudi
spec:
  displayName: Habana HPU - 1st Gen Gaudi
  description: First Generation Habana Gaudi device
  enabled: true
  identifier: habana.ai/gaudi
  tolerations:
    - effect: NoSchedule
      key: habana.ai/gaudi
      operator: Exists
---

The accelerator profile code appears on the Instances tab on the details page for the AcceleratorProfile custom resource definition (CRD). For more information about accelerator profile attributes, see the following table:

Table 5.1. Accelerator profile attributes

AttributeTypeRequiredDescription

displayName

String

Required

The display name of the accelerator profile.

description

String

Optional

Descriptive text defining the accelerator profile.

identifier

String

Required

A unique identifier defining the accelerator resource.

enabled

Boolean

Required

Determines if the accelerator is visible in OpenShift AI.

tolerations

Array

Optional

The tolerations that can apply to notebooks and serving runtimes that use the accelerator. For more information about the toleration attributes that OpenShift AI supports, see Toleration v1 core.

5.2.1. Viewing accelerator profiles

If you have defined accelerator profiles for OpenShift AI, you can view, enable, and disable them from the Accelerator profiles page.

Prerequisites

  • You have logged in to Red Hat OpenShift AI.
  • You are part of the cluster-admins or dedicated-admins user group in your OpenShift cluster. The dedicated-admins user group applies only to OpenShift Dedicated.
  • Your deployment contains existing accelerator profiles.

Procedure

  1. From the OpenShift AI dashboard, click SettingsAccelerator profiles.

    The Accelerator profiles page appears, displaying existing accelerator profiles.

  2. Inspect the list of accelerator profiles. To enable or disable an accelerator profile, on the row containing the accelerator profile, click the toggle in the Enable column.

Verification

  • The Accelerator profiles page appears appears, displaying existing accelerator profiles.

5.2.2. Creating an accelerator profile

To configure accelerators for your data scientists to use in OpenShift AI, you must create an associated accelerator profile.

Prerequisites

  • You have logged in to Red Hat OpenShift AI.
  • You are part of the cluster-admins or dedicated-admins user group in your OpenShift cluster. The dedicated-admins user group applies only to OpenShift Dedicated.

Procedure

  1. From the OpenShift AI dashboard, click SettingsAccelerator profiles.

    The Accelerator profiles page appears, displaying existing accelerator profiles. To enable or disable an existing accelerator profile, on the row containing the relevant accelerator profile, click the toggle in the Enable column.

  2. Click Create accelerator profile.

    The Create accelerator profile dialog appears.

  3. In the Name field, enter a name for the accelerator profile.
  4. In the Identifier field, enter a unique string that identifies the hardware accelerator associated with the accelerator profile.
  5. Optional: In the Description field, enter a description for the accelerator profile.
  6. To enable or disable the accelerator profile immediately after creation, click the toggle in the Enable column.
  7. Optional: Add a toleration to schedule pods with matching taints.

    1. Click Add toleration.

      The Add toleration dialog opens.

    2. From the Operator list, select one of the following options:

      • Equal - The key/value/effect parameters must match. This is the default.
      • Exists - The key/effect parameters must match. You must leave a blank value parameter, which matches any.
    3. From the Effect list, select one of the following options:

      • None
      • NoSchedule - New pods that do not match the taint are not scheduled onto that node. Existing pods on the node remain.
      • PreferNoSchedule - New pods that do not match the taint might be scheduled onto that node, but the scheduler tries not to. Existing pods on the node remain.
      • NoExecute - New pods that do not match the taint cannot be scheduled onto that node. Existing pods on the node that do not have a matching toleration are removed.
    4. In the Key field, enter a toleration key. The key is any string, up to 253 characters. The key must begin with a letter or number, and may contain letters, numbers, hyphens, dots, and underscores.
    5. In the Value field, enter a toleration value. The value is any string, up to 63 characters. The value must begin with a letter or number, and may contain letters, numbers, hyphens, dots, and underscores.
    6. In the Toleration Seconds section, select one of the following options to specify how long a pod stays bound to a node that has a node condition.

      • Forever - Pods stays permanently bound to a node.
      • Custom value - Enter a value, in seconds, to define how long pods stay bound to a node that has a node condition.
    7. Click Add.
  8. Click Create accelerator profile.

Verification

  • The accelerator profile appears on the Accelerator profiles page.
  • The Accelerator list appears on the Start a notebook server page. After you select an accelerator, the Number of accelerators field appears, which you can use to choose the number of accelerators for your notebook server.
  • The accelerator profile appears on the Instances tab on the details page for the AcceleratorProfile custom resource definition (CRD).

5.2.3. Updating an accelerator profile

You can update the existing accelerator profiles in your deployment. You might want to change important identifying information, such as the display name, the identifier, or the description.

Prerequisites

  • You have logged in to Red Hat OpenShift AI.
  • You are part of the cluster-admins or dedicated-admins user group in your OpenShift cluster. The dedicated-admins user group applies only to OpenShift Dedicated.
  • The accelerator profile exists in your deployment.

Procedure

  1. From the OpenShift AI dashboard, click SettingsNotebook images.

    The Notebook images page appears. Previously imported notebook images are displayed. To enable or disable a previously imported notebook image, on the row containing the relevant notebook image, click the toggle in the Enable column.

  2. Click the action menu (⋮) and select Edit from the list.

    The Edit accelerator profile dialog opens.

  3. In the Name field, update the accelerator profile name.
  4. In the Identifier field, update the unique string that identifies the hardware accelerator associated with the accelerator profile, if applicable.
  5. Optional: In the Description field, update the accelerator profile.
  6. To enable or disable the accelerator profile immediately after creation, click the toggle in the Enable column.
  7. Optional: Add a toleration to schedule pods with matching taints.

    1. Click Add toleration.

      The Add toleration dialog opens.

    2. From the Operator list, select one of the following options:

      • Equal - The key/value/effect parameters must match. This is the default.
      • Exists - The key/effect parameters must match. You must leave a blank value parameter, which matches any.
    3. From the Effect list, select one of the following options:

      • None
      • NoSchedule - New pods that do not match the taint are not scheduled onto that node. Existing pods on the node remain.
      • PreferNoSchedule - New pods that do not match the taint might be scheduled onto that node, but the scheduler tries not to. Existing pods on the node remain.
      • NoExecute - New pods that do not match the taint cannot be scheduled onto that node. Existing pods on the node that do not have a matching toleration are removed.
    4. In the Key field, enter a toleration key. The key is any string, up to 253 characters. The key must begin with a letter or number, and may contain letters, numbers, hyphens, dots, and underscores.
    5. In the Value field, enter a toleration value. The value is any string, up to 63 characters. The value must begin with a letter or number, and may contain letters, numbers, hyphens, dots, and underscores.
    6. In the Toleration Seconds section, select one of the following options to specify how long a pod stays bound to a node that has a node condition.

      • Forever - Pods stays permanently bound to a node.
      • Custom value - Enter a value, in seconds, to define how long pods stay bound to a node that has a node condition.
    7. Click Add.
  8. If your accelerator profile contains existing tolerations, you can edit them.

    1. Click the action menu (⋮) on the row containing the toleration that you want to edit and select Edit from the list.
    2. Complete the applicable fields to update the details of the toleration.
    3. Click Update.
  9. Click Update accelerator profile.

Verification

  • If your accelerator profile has new identifying information, this information appears in the Accelerator list on the Start a notebook server page.

5.2.4. Deleting an accelerator profile

To discard accelerator profiles that you no longer require, you can delete them so that they do not appear on the dashboard.

Prerequisites

  • You have logged in to Red Hat OpenShift AI.
  • You are part of the cluster-admins or dedicated-admins user group in your OpenShift cluster. The dedicated-admins user group applies only to OpenShift Dedicated.
  • The accelerator profile that you want to delete exists in your deployment.

Procedure

  1. From the OpenShift AI dashboard, click SettingsAccelerator profiles.

    The Accelerator profiles page appears, displaying existing accelerator profiles.

  2. Click the action menu () beside the accelerator profile that you want to delete and click Delete.

    The Delete accelerator profile dialog opens.

  3. Enter the name of the accelerator profile in the text field to confirm that you intend to delete it.
  4. Click Delete.

Verification

  • The accelerator profile no longer appears on the Accelerator profiles page.

5.3. Habana Gaudi integration

To accelerate your high-performance deep learning (DL) models, you can integrate Habana Gaudi devices in OpenShift AI. OpenShift AI also includes the HabanaAI workbench image, which is pre-built and ready for your data scientists to use after you install or upgrade OpenShift AI.

Before you can enable Habana Gaudi devices in OpenShift AI, you must install the necessary dependencies and the version of the HabanaAI Operator that matches the Habana version of the HabanaAI workbench image in your deployment. This allows your data scientists to use Habana libraries and software associated with Habana Gaudi devices from their workbench.

For more information about how to enable your OpenShift environment for Habana Gaudi devices, see HabanaAI Operator v1.10 for OpenShift and HabanaAI Operator v1.13 for OpenShift.

Important

Currently, Habana Gaudi integration is only supported in OpenShift 4.12.

You can use Habana Gaudi accelerators on OpenShift AI with versions 1.10.0 and 1.13.0 of the Habana Gaudi Operator. The version of the HabanaAI Operator that you install must match the Habana version of the HabanaAI workbench image in your deployment. This means that only one version of HabanaAI workbench image will work for you at a time.

For information about the supported configurations for versions 1.10 and 1.13 of the Habana Gaudi Operator, see Support Matrix v1.10.0 and Support Matrix v1.13.0.

You can use Habana Gaudi devices in an Amazon EC2 DL1 instance on OpenShift. Therefore, your OpenShift platform must support EC2 DL1 instances. Habana Gaudi accelerators are available to your data scientists when they create a workbench instance or serve a model.

To identify the Habana Gaudi devices present in your deployment, use the lspci utility. For more information, see lspci(8) - Linux man page.

Important

If the lspci utility indicates that Habana Gaudi devices are present in your deployment, it does not necessarily mean that the devices are ready to use.

Before you can use your Habana Gaudi devices, you must enable them in your OpenShift environment and configure an accelerator profile for each device. For more information about how to enable your OpenShift environment for Habana Gaudi devices, see HabanaAI Operator for OpenShift.

5.3.1. Enabling Habana Gaudi devices

Before you can use Habana Gaudi devices in OpenShift AI, you must install the necessary dependencies and deploy the HabanaAI Operator.

Prerequisites

  • You have logged in to OpenShift.
  • You have the cluster-admin role in OpenShift.

Procedure

  1. To enable Habana Gaudi devices in OpenShift AI, follow the instructions at HabanaAI Operator for OpenShift.
  2. From the OpenShift AI dashboard, click SettingsAccelerator profiles.

    The Accelerator profiles page appears, displaying existing accelerator profiles. To enable or disable an existing accelerator profile, on the row containing the relevant accelerator profile, click the toggle in the Enable column.

  3. Click Create accelerator profile.

    The Create accelerator profile dialog opens.

  4. In the Name field, enter a name for the Habana Gaudi device.
  5. In the Identifier field, enter a unique string that identifies the Habana Gaudi device, for example, habana.ai/gaudi.
  6. Optional: In the Description field, enter a description for the Habana Gaudi device.
  7. To enable or disable the accelerator profile for the Habana Gaudi device immediately after creation, click the toggle in the Enable column.
  8. Optional: Add a toleration to schedule pods with matching taints.

    1. Click Add toleration.

      The Add toleration dialog opens.

    2. From the Operator list, select one of the following options:

      • Equal - The key/value/effect parameters must match. This is the default.
      • Exists - The key/effect parameters must match. You must leave a blank value parameter, which matches any.
    3. From the Effect list, select one of the following options:

      • None
      • NoSchedule - New pods that do not match the taint are not scheduled onto that node. Existing pods on the node remain.
      • PreferNoSchedule - New pods that do not match the taint might be scheduled onto that node, but the scheduler tries not to. Existing pods on the node remain.
      • NoExecute - New pods that do not match the taint cannot be scheduled onto that node. Existing pods on the node that do not have a matching toleration are removed.
    4. In the Key field, enter the toleration key habana.ai/gaudi. The key is any string, up to 253 characters. The key must begin with a letter or number, and may contain letters, numbers, hyphens, dots, and underscores.
    5. In the Value field, enter a toleration value. The value is any string, up to 63 characters. The value must begin with a letter or number, and may contain letters, numbers, hyphens, dots, and underscores.
    6. In the Toleration Seconds section, select one of the following options to specify how long a pod stays bound to a node that has a node condition.

      • Forever - Pods stays permanently bound to a node.
      • Custom value - Enter a value, in seconds, to define how long pods stay bound to a node that has a node condition.
    7. Click Add.
  9. Click Create accelerator profile.

Verification

  • From the Administrator perspective, the following Operators appear on the OperatorsInstalled Operators page.

    • HabanaAI
    • Node Feature Discovery (NFD)
    • Kernel Module Management (KMM)
  • The Accelerator list displays the Habana Gaudi accelerator on the Start a notebook server page. After you select an accelerator, the Number of accelerators field appears, which you can use to choose the number of accelerators for your notebook server.
  • The accelerator profile appears on the Accelerator profiles page
  • The accelerator profile appears on the Instances tab on the details page for the AcceleratorProfile custom resource definition (CRD).