Chapter 2. Configuring distributed workloads

To configure the distributed workloads feature for your data scientists to use in OpenShift AI, you must create the required Kueue resources, enable several components in the Red Hat OpenShift AI Add-on, and optionally configure the CodeFlare Operator.

2.1. Configuring the distributed workloads components

To configure the distributed workloads feature for your data scientists to use in OpenShift AI, you must enable several components.

Prerequisites

  • You have logged in to OpenShift with the cluster-admin role.
  • You have access to the data science cluster.
  • You have installed Red Hat OpenShift AI.
  • You have sufficient resources. In addition to the minimum OpenShift AI resources described in Installing and deploying OpenShift AI, you need 1.6 vCPU and 2 GiB memory to deploy the distributed workloads infrastructure.
  • You have access to a Ray cluster image. For information about how to create a Ray cluster, see the Ray Clusters documentation.

    Note

    Mutual Transport Layer Security (mTLS) is enabled by default in the CodeFlare component in OpenShift AI. In the current OpenShift AI version, submissionMode=K8sJobMode is not supported in the Ray job specification, so the KubeRay Operator cannot create a submitter Kubernetes Job to submit the Ray job. Instead, users must configure the Ray job specification to set submissionMode=HTTPMode only, so that the KubeRay Operator sends a request to the RayCluster to create a Ray job.

  • You have access to the data sets and models that the distributed workload uses.
  • You have access to the Python dependencies for the distributed workload.
  • You have removed any previously installed instances of the CodeFlare Operator, as described in the Knowledgebase solution How to migrate from a separately installed CodeFlare Operator in your data science cluster.
  • If you want to use graphics processing units (GPUs), you have enabled GPU support in OpenShift AI. See Enabling GPU support in OpenShift AI.
  • If you want to use self-signed certificates, you have added them to a central Certificate Authority (CA) bundle as described in Working with certificates. No additional configuration is necessary to use those certificates with distributed workloads. The centrally configured self-signed certificates are automatically available in the workload pods at the following mount points:

    • Cluster-wide CA bundle:

      /etc/pki/tls/certs/odh-trusted-ca-bundle.crt
      /etc/ssl/certs/odh-trusted-ca-bundle.crt
    • Custom CA bundle:

      /etc/pki/tls/certs/odh-ca-bundle.crt
      /etc/ssl/certs/odh-ca-bundle.crt

Procedure

  1. In the OpenShift console, click OperatorsInstalled Operators.
  2. Search for the Red Hat OpenShift AI Operator, and then click the Operator name to open the Operator details page.
  3. Click the Data Science Cluster tab.
  4. Click the default instance name (for example, default-dsc) to open the instance details page.
  5. Click the YAML tab to show the instance specifications.
  6. Enable the required distributed workloads components. In the spec:components section, set the managementState field correctly for the required components. The list of required components depends on whether the distributed workload is run from a pipeline or notebook or both, as shown in the following table.

    Table 2.1. Components required for distributed workloads

    ComponentPipelines onlyNotebooks onlyPipelines and notebooks

    codeflare

    Managed

    Managed

    Managed

    dashboard

    Managed

    Managed

    Managed

    datasciencepipelines

    Managed

    Removed

    Managed

    kueue

    Managed

    Managed

    Managed

    ray

    Managed

    Managed

    Managed

    workbenches

    Removed

    Managed

    Managed

  7. Click Save. After a short time, the components with a Managed state are ready.

Verification

Check the status of the codeflare-operator-manager, kuberay-operator, and kueue-controller-manager pods, as follows:

  1. In the OpenShift console, from the Project list, select redhat-ods-applications.
  2. Click WorkloadsDeployments.
  3. Search for the codeflare-operator-manager, kuberay-operator, and kueue-controller-manager deployments. In each case, check the status as follows:

    1. Click the deployment name to open the deployment details page.
    2. Click the Pods tab.
    3. Check the pod status.

      When the status of the codeflare-operator-manager-<pod-id>, kuberay-operator-<pod-id>, and kueue-controller-manager-<pod-id> pods is Running, the pods are ready to use.

    4. To see more information about each pod, click the pod name to open the pod details page, and then click the Logs tab.

2.2. Configuring quota management for distributed workloads

Configure quotas for distributed workloads on a cluster, so that you can share resources between several data science projects.

Prerequisites

  • You have cluster administrator privileges for your OpenShift cluster.
  • You have downloaded and installed the OpenShift command-line interface (CLI). See Installing the OpenShift CLI (Red Hat OpenShift Dedicated) or Installing the OpenShift CLI (Red Hat OpenShift Service on AWS).
  • You have enabled the required distributed workloads components as described in Configuring the distributed workloads components.
  • You have sufficient resources. In addition to the base OpenShift AI resources, you need 1.6 vCPU and 2 GiB memory to deploy the distributed workloads infrastructure.
  • The resources are physically available in the cluster.

    Note

    OpenShift AI currently supports only a single cluster queue per cluster (that is, homogenous clusters), and only empty resource flavors. For more information about Kueue resources, see the Kueue documentation.

Procedure

  1. In a terminal window, if you are not already logged in to your OpenShift cluster as a cluster administrator, log in to the OpenShift CLI as shown in the following example:

    $ oc login <openshift_cluster_url> -u <admin_username> -p <password>
  2. Create an empty Kueue resource flavor, as follows:

    1. Create a file called default_flavor.yaml and populate it with the following content:

      Empty Kueue resource flavor

      apiVersion: kueue.x-k8s.io/v1beta1
      kind: ResourceFlavor
      metadata:
        name: default-flavor

    2. Apply the configuration to create the default-flavor object:

      $ oc apply -f default_flavor.yaml
  3. Create a cluster queue to manage the empty Kueue resource flavor, as follows:

    1. Create a file called cluster_queue.yaml and populate it with the following content:

      Example cluster queue

      apiVersion: kueue.x-k8s.io/v1beta1
      kind: ClusterQueue
      metadata:
        name: "cluster-queue"
      spec:
        namespaceSelector: {}  # match all.
        resourceGroups:
        - coveredResources: ["cpu", "memory", "nvidia.com/gpu"]
          flavors:
          - name: "default-flavor"
            resources:
            - name: "cpu"
              nominalQuota: 9
            - name: "memory"
              nominalQuota: 36Gi
            - name: "nvidia.com/gpu"
              nominalQuota: 5

    2. Replace the example quota values (9 CPUs, 36 GiB memory, and 5 NVIDIA GPUs) with the appropriate values for your cluster queue. The cluster queue will start a distributed workload only if the total required resources are within these quota limits.

      Note

      In this release of OpenShift AI, the only accelerators supported for distributed workloads are NVIDIA GPUs.

    3. Apply the configuration to create the cluster-queue object:

      $ oc apply -f cluster_queue.yaml
  4. Create a local queue that points to your cluster queue, as follows:

    1. Create a file called local_queue.yaml and populate it with the following content:

      Example local queue

      apiVersion: kueue.x-k8s.io/v1beta1
      kind: LocalQueue
      metadata:
        namespace: test
        name: local-queue-test
        annotations:
          kueue.x-k8s.io/default-queue: 'true'
      spec:
        clusterQueue: cluster-queue

      The kueue.x-k8s.io/default-queue: 'true' annotation defines this queue as the default queue. Distributed workloads are submitted to this queue if no local_queue value is specified in the ClusterConfiguration section of the data science pipeline or Jupyter notebook or Microsoft Visual Studio Code file.

    2. Update the namespace value to specify the same namespace as in the ClusterConfiguration section that creates the Ray cluster.
    3. Optional: Update the name value accordingly.
    4. Apply the configuration to create the local-queue object:

      $ oc apply -f local_queue.yaml

      The cluster queue allocates the resources to run distributed workloads in the local queue.

Verification

Check the status of the local queue in a project, as follows:

$ oc get -n <project-name> localqueues

Additional resources

2.3. Configuring the CodeFlare Operator

If you want to change the default configuration of the CodeFlare Operator for distributed workloads in OpenShift AI, you can edit the associated config map.

Prerequisites

Procedure

  1. In the OpenShift console, click WorkloadsConfigMaps.
  2. From the Project list, select redhat-ods-applications.
  3. Search for the codeflare-operator-config config map, and click the config map name to open the ConfigMap details page.
  4. Click the YAML tab to show the config map specifications.
  5. In the data:config.yaml:kuberay section, you can edit the following entries:

    ingressDomain

    This configuration option is null (ingressDomain: "") by default. Do not change this option unless the Ingress Controller is not running on OpenShift. OpenShift AI uses this value to generate the dashboard and client routes for every Ray Cluster, as shown in the following examples:

    Example dashboard and client routes

    ray-dashboard-<clustername>-<namespace>.<your.ingress.domain>
    ray-client-<clustername>-<namespace>.<your.ingress.domain>

    mTLSEnabled

    This configuration option is enabled (mTLSEnabled: true) by default. When this option is enabled, the Ray Cluster pods create certificates that are used for mutual Transport Layer Security (mTLS), a form of mutual authentication, between Ray Cluster nodes. When this option is enabled, Ray clients cannot connect to the Ray head node unless they download the generated certificates from the ca-secret-_<cluster_name>_ secret, generate the necessary certificates for mTLS communication, and then set the required Ray environment variables. Users must then re-initialize the Ray clients to apply the changes. The CodeFlare SDK provides the following functions to simplify the authentication process for Ray clients:

    Example Ray client authentication code

    from codeflare_sdk import generate_cert
    
    generate_cert.generate_tls_cert(cluster.config.name, cluster.config.namespace)
    generate_cert.export_env(cluster.config.name, cluster.config.namespace)
    
    ray.init(cluster.cluster_uri())

    rayDashboardOauthEnabled

    This configuration option is enabled (rayDashboardOAuthEnabled: true) by default. When this option is enabled, OpenShift AI places an OpenShift OAuth proxy in front of the Ray Cluster head node. Users must then authenticate by using their OpenShift cluster login credentials when accessing the Ray Dashboard through the browser. If users want to access the Ray Dashboard in another way (for example, by using the Ray JobSubmissionClient class), they must set an authorization header as part of their request, as shown in the following example:

    Example authorization header

    {Authorization: "Bearer <your-openshift-token>"}

  6. To save your changes, click Save.
  7. To apply your changes, delete the pod:

    1. Click WorkloadsPods.
    2. Find the codeflare-operator-manager-<pod-id> pod.
    3. Click the options menu (⋮) for that pod, and then click Delete Pod. The pod restarts with your changes applied.

Verification

Check the status of the codeflare-operator-manager pod, as follows:

  1. In the OpenShift console, click WorkloadsDeployments.
  2. Search for the codeflare-operator-manager deployment, and then click the deployment name to open the deployment details page.
  3. Click the Pods tab. When the status of the codeflare-operator-manager-<pod-id> pod is Running, the pod is ready to use. To see more information about the pod, click the pod name to open the pod details page, and then click the Logs tab.