13.2. 安装 NVIDIA GPU 管理仪表板

通过在 OpenShift Container Platform (OCP) 控制台上使用 Helm 安装 NVIDIA GPU 插件来添加 GPU 功能。

OpenShift Console NVIDIA GPU 插件作为 OCP 控制台的远程捆绑包运行。要运行 OpenShift Console NVIDIA GPU 插件,必须将 OCP 控制台实例正在运行。

先决条件

  • Red Hat OpenShift 4.11+
  • NVIDIA GPU operator
  • Helm

步骤

使用以下步骤安装 OpenShift Console NVIDIA GPU 插件。

  1. 添加 Helm 仓库:

    $ helm repo add rh-ecosystem-edge https://rh-ecosystem-edge.github.io/console-plugin-nvidia-gpu
    $ helm repo update
  2. 在默认的 NVIDIA GPU Operator 命名空间中安装 Helm Chart:

    $ helm install -n nvidia-gpu-operator console-plugin-nvidia-gpu rh-ecosystem-edge/console-plugin-nvidia-gpu

    输出示例

    NAME: console-plugin-nvidia-gpu
    LAST DEPLOYED: Tue Aug 23 15:37:35 2022
    NAMESPACE: nvidia-gpu-operator
    STATUS: deployed
    REVISION: 1
    NOTES:
    View the Console Plugin NVIDIA GPU deployed resources by running the following command:
    
    $ oc -n {{ .Release.Namespace }} get all -l app.kubernetes.io/name=console-plugin-nvidia-gpu
    
    Enable the plugin by running the following command:
    
    # Check if a plugins field is specified
    $ oc get consoles.operator.openshift.io cluster --output=jsonpath="{.spec.plugins}"
    
    # if not, then run the following command to enable the plugin
    $ oc patch consoles.operator.openshift.io cluster --patch '{ "spec": { "plugins": ["console-plugin-nvidia-gpu"] } }' --type=merge
    
    # if yes, then run the following command to enable the plugin
    $ oc patch consoles.operator.openshift.io cluster --patch '[{"op": "add", "path": "/spec/plugins/-", "value": "console-plugin-nvidia-gpu" }]' --type=json
    
    # add the required DCGM Exporter metrics ConfigMap to the existing NVIDIA operator ClusterPolicy CR:
    oc patch clusterpolicies.nvidia.com gpu-cluster-policy --patch '{ "spec": { "dcgmExporter": { "config": { "name": "console-plugin-nvidia-gpu" } } } }' --type=merge

    仪表板主要依赖于 NVIDIA DCGM Exporter 公开的 Prometheus 指标,但默认公开的指标不足以便仪表板显示所需的量表。因此,DGCM 导出器配置为公开一组自定义指标,如下所示。

    apiVersion: v1
    data:
      dcgm-metrics.csv: |
        DCGM_FI_PROF_GR_ENGINE_ACTIVE, gauge, gpu utilization.
        DCGM_FI_DEV_MEM_COPY_UTIL, gauge, mem utilization.
        DCGM_FI_DEV_ENC_UTIL, gauge, enc utilization.
        DCGM_FI_DEV_DEC_UTIL, gauge, dec utilization.
        DCGM_FI_DEV_POWER_USAGE, gauge, power usage.
        DCGM_FI_DEV_POWER_MGMT_LIMIT_MAX, gauge, power mgmt limit.
        DCGM_FI_DEV_GPU_TEMP, gauge, gpu temp.
        DCGM_FI_DEV_SM_CLOCK, gauge, sm clock.
        DCGM_FI_DEV_MAX_SM_CLOCK, gauge, max sm clock.
        DCGM_FI_DEV_MEM_CLOCK, gauge, mem clock.
        DCGM_FI_DEV_MAX_MEM_CLOCK, gauge, max mem clock.
    kind: ConfigMap
    metadata:
      annotations:
        meta.helm.sh/release-name: console-plugin-nvidia-gpu
        meta.helm.sh/release-namespace: nvidia-gpu-operator
      creationTimestamp: "2022-10-26T19:46:41Z"
      labels:
        app.kubernetes.io/component: console-plugin-nvidia-gpu
        app.kubernetes.io/instance: console-plugin-nvidia-gpu
        app.kubernetes.io/managed-by: Helm
        app.kubernetes.io/name: console-plugin-nvidia-gpu
        app.kubernetes.io/part-of: console-plugin-nvidia-gpu
        app.kubernetes.io/version: latest
        helm.sh/chart: console-plugin-nvidia-gpu-0.2.3
      name: console-plugin-nvidia-gpu
      namespace: nvidia-gpu-operator
      resourceVersion: "19096623"
      uid: 96cdf700-dd27-437b-897d-5cbb1c255068

    安装 ConfigMap 并编辑 NVIDIA Operator ClusterPolicy CR,以在 DCGM 导出器配置中添加该 ConfigMap。ConfigMap 的安装由 Console Plugin NVIDIA GPU Helm Chart 的新版本完成,但 ClusterPolicy CR 编辑由用户执行。

  3. 查看部署的资源:

    $ oc -n nvidia-gpu-operator get all -l app.kubernetes.io/name=console-plugin-nvidia-gpu

    输出示例

    NAME                                             READY   STATUS    RESTARTS   AGE
    pod/console-plugin-nvidia-gpu-7dc9cfb5df-ztksx   1/1     Running   0          2m6s
    
    NAME                                TYPE        CLUSTER-IP       EXTERNAL-IP   PORT(S)    AGE
    service/console-plugin-nvidia-gpu   ClusterIP   172.30.240.138   <none>        9443/TCP   2m6s
    
    NAME                                        READY   UP-TO-DATE   AVAILABLE   AGE
    deployment.apps/console-plugin-nvidia-gpu   1/1     1            1           2m6s
    
    NAME                                                   DESIRED   CURRENT   READY   AGE
    replicaset.apps/console-plugin-nvidia-gpu-7dc9cfb5df   1         1         1       2m6s