How to deploy a machine learning model by using KServe RawDeployment mode with single node OpenShift

Solution In Progress - Updated -

Environment

Red Hat OpenShift AI 2.11

Issue

As there is no customer portal documentation currently available on deploying a machine learning model using KServe RawDeployment mode on single node OpenShift, follow the instructions in this article.

Note: The instructions in this article are only valid for new installations of Red Hat OpenShift AI (RHOAI) 2.11 self-managed. However, these steps are not valid for disconnected self-managed deployments.

Important: Deploying a machine learning model using KServe RawDeployment mode on single node OpenShift is a Limited Availability feature. Limited Availability means that you can install and receive support for the feature only with specific approval from the Red Hat AI Business Unit. Without such approval, the feature is unsupported. This applies to all content described in this article.

Kserve supports two types of deployment modes: Serverless and RawDeployment. There are both advantages and disadvantages to using each of these deployment modes:

  • Serverless mode

    • Advantages:

      • Enables autoscaling based on request volume: This means resources scale up automatically based on incoming requests, which can optimize resource usage and maintain performance during peak times.

      • Supports scale down to and from zero using Knative: This capability allows resources to scale down completely when there are no incoming requests, which can save costs by not running idle resources.

    • Disadvantages:

      • Has customization limitations: Serverless is limited to Knative, such as when mounting multiple volumes.

      • Dependency on Knative: Using Knative for scaling can introduce additional complexity in setup and management compared to traditional scaling methods.

  • RawDeployment mode

    • Advantages:

      • Enables deployment with Kubernetes resources like Deployment, Service, Ingress, and Horizontal Pod Autoscaler: Provides full control over Kubernetes resources, allowing for detailed customization and configuration of deployment settings.

      • Unlocks Knative limitations such as mounting multiple volumes: Overcomes specific restrictions of Knative, which can be beneficial for applications requiring complex configurations or multiple storage mounts.

    • Disadvantages:

      • Does not support "Scale down to and from Zero": Unlike Serverless mode with Knative, RawDeployment mode does not support automatic scaling down to zero resources when idle, which may result in higher costs during periods of low traffic.

      • Requires manual management of scaling.

As a prerequisite to performing the steps, ensure that have met the following criteria:

  • You have installed the OpenShift command-line interface (CLI). For more information about installing the OpenShift command-line interface (CLI), see Getting started with the OpenShift CLI.
  • You have cluster administrator privileges for your OpenShift Container Platform cluster.
  • You have created an OpenShift cluster that has a node with 4 CPUs and 16 GB memory.
  • You have installed the Red Hat OpenShift AI (RHOAI) Operator.

Resolution

  1. Open a command-line terminal and log in to your OpenShift cluster as cluster administrator:
$ oc login <openshift_cluster_url> -u <admin_username> -p <password>
  1. By default, OpenShift uses a service mesh for network traffic management. As KServe RawDeployment mode does not require a service mesh, disable Red Hat OpenShift Service Mesh:
$ oc edit dsci -n redhat-ods-operator
  1. In the YAML editor, change the value of managementState for the serviceMesh component to Removed.
  2. Save the changes.
  3. Create a project. For information about creating projects, see Working with projects.
$ oc new-project <project_name> --description="<description>" --display-name="<display_name>"
  1. In the Red Hat OpenShift web console Administrator view, click OperatorsInstalled Operators and then click the Red Hat OpenShift AI Operator.
  2. Click the Data Science Cluster tab.
  3. Click the Create DataScienceCluster button.
  4. In the Configure via field, click the YAML view radio button.
  5. In the spec.components section of the YAML editor, configure the kserve component as shown:
  kserve:
    defaultDeploymentMode: RawDeployment
    managementState: Managed
    serving:
      managementState: Removed
      name: knative-serving
  1. Click Create.
  2. Create a secret file. At your command-line terminal, create a YAML file to contain your secret and add the following YAML code:
kind: Secret
apiVersion: v1
metadata: 
  name: <secret-name>
data: 
  AWS_ACCESS_KEY_ID: <base64-encoded-access-key-id>
  AWS_DEFAULT_REGION: <base64-encoded-region>
  AWS_S3_BUCKET: <base64-encoded-bucket-name>
  AWS_S3_ENDPOINT: <base64-encoded-endpoint>
  AWS_SECRET_ACCESS_KEY: <base64-encoded-secret-access-key>
type: Opaque
  1. Save the file with the file name secret.yaml.
  2. Apply the secret.yaml file:
$ oc apply -f secret.yaml -n <namespace>
  1. Create a service account. Create a YAML file to contain your service account and add the following YAML code:
apiVersion: v1
kind: ServiceAccount
metadata:
  name: models-bucket-sa
secrets:
- name: s3creds

For information about service accounts, see Understanding and creating service accounts.
16. Save the file with the file name serviceAccount.yaml.
17. Apply the serviceAccount.yaml file:

$ oc apply -f serviceAccount.yaml -n <namespace>
  1. Create a YAML file for the serving runtime to define the container image that will serve your model predictions. Here is an example using the OpenVino Model Server:
apiVersion: serving.kserve.io/v1alpha1
kind: ServingRuntime
metadata:
  name: ovms-runtime
spec:
  annotations:
    prometheus.io/path: /metrics
    prometheus.io/port: "8888"
  containers:
    - args:
        - --model_name={{.Name}}
        - --port=8001
        - --rest_port=8888
        - --model_path=/mnt/models
        - --file_system_poll_wait_seconds=0
        - --grpc_bind_address=0.0.0.0
        - --rest_bind_address=0.0.0.0
        - --target_device=AUTO
        - --metrics_enable
      image: quay.io/modh/openvino_model_server@sha256:6c7795279f9075bebfcd9aecbb4a4ce4177eec41fb3f3e1f1079ce6309b7ae45
      name: kserve-container
      ports:
        - containerPort: 8888
          protocol: TCP
  multiModel: false
  protocolVersions:
    - v2
    - grpc-v2
  supportedModelFormats:
    - autoSelect: true
      name: openvino_ir
      version: opset13
    - name: onnx
      version: "1"
    - autoSelect: true
      name: tensorflow
      version: "1"
    - autoSelect: true
      name: tensorflow
      version: "2"
    - autoSelect: true
      name: paddle
      version: "2"
    - autoSelect: true
      name: pytorch
      version: "2"
  1. If you are using the OpenVINO Model Server example above, ensure that you insert the correct values required for any placeholders in the YAML code.
  2. Save the file with an appropriate file name.
  3. Apply the file containing your serving run time:
$ oc apply -f <serving run time file name> -n <namespace>
  1. Create an InferenceService custom resource (CR). Create a YAML file to contain the InferenceService CR. Using the OpenVINO Model Server example used previously, here is the corresponding YAML code:
apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
  annotations:
    serving.knative.openshift.io/enablePassthrough: "true"
    sidecar.istio.io/inject: "true"
    sidecar.istio.io/rewriteAppHTTPProbers: "true"
    serving.kserve.io/deploymentMode: RawDeployment
  name: <InferenceService-Name>
spec:
  predictor:
    scaleMetric:
    minReplicas: 1
    scaleTarget:
    canaryTrafficPercent:
    serviceAccountName: <serviceAccountName>
    model:
      env: []
      volumeMounts: []
      modelFormat:
        name: onnx
      runtime: ovms-runtime
      storageUri: s3://<bucket_name>/<model_directory_path>
      resources:
        requests:
          memory: 5Gi
    volumes: []
  1. In your YAML code, ensure the following values are set correctly:
  • serving.kserve.io/deploymentMode must contain the value RawDeployment.
  • modelFormat must contain the value for your model format, such as onnx.
  • storageUri must contain the value for your model s3 storage directory, for example s3://<bucket_name>/<model_directory_path>.
  • runtime must contain the value for the name of your serving runtime, for example,ovms-runtime.
  1. Save the file with an appropriate file name.
  2. Apply the file containing your InferenceService CR:
$ oc apply -f <InferenceService CR file name> -n <namespace>
  1. Verify that all pods are running in your cluster:
$ oc get pods -n <namespace>

Example output:

NAME READY STATUS RESTARTS AGE 
<isvc_name>-predictor-xxxxx-2mr5l 1/1 Running 2 165m
console-698d866b78-m87pm 1/1 Running 2 165m
  1. After you verify that all pods are running, forward the service port to your local machine:
$ oc -n <namespace> port-forward pod/<pod-name> <local_port>:<remote_port>

Ensure that you replace <namespace>, <pod-name>, <local_port>, <remote_port> (this is the model server port, for example, 8888) with values appropriate to your deployment.
28. Use your preferred client library or tool to send requests to the localhost inference URL.

This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.

Comments