6.6. 对 NUMA 感知调度进行故障排除

要排除 NUMA 感知 pod 调度的常见问题,请执行以下步骤。

先决条件

  • 安装 OpenShift Container Platform CLI(oc)。
  • 以具有 cluster-admin 权限的用户身份登录。
  • 安装 NUMA Resources Operator 并部署 NUMA 感知辅助调度程序。

流程

  1. 运行以下命令,验证 noderesourcetopologies CRD 是否已在集群中部署:

    $ oc get crd | grep noderesourcetopologies

    输出示例

    NAME                                                              CREATED AT
    noderesourcetopologies.topology.node.k8s.io                       2022-01-18T08:28:06Z

  2. 运行以下命令,检查 NUMA-aware 调度程序名称是否与 NUMA 感知工作负载中指定的名称匹配:

    $ oc get numaresourcesschedulers.nodetopology.openshift.io numaresourcesscheduler -o json | jq '.status.schedulerName'

    输出示例

    topo-aware-scheduler

  3. 验证 NUMA-aware scheduable 节点是否应用了 noderesourcetopologies CR。运行以下命令:

    $ oc get noderesourcetopologies.topology.node.k8s.io

    输出示例

    NAME                    AGE
    compute-0.example.com   17h
    compute-1.example.com   17h

    注意

    节点数应该等于机器配置池 (mcp) worker 定义中配置的 worker 节点数量。

  4. 运行以下命令,验证所有 scheduable 节点的 NUMA 区粒度:

    $ oc get noderesourcetopologies.topology.node.k8s.io -o yaml

    输出示例

    apiVersion: v1
    items:
    - apiVersion: topology.node.k8s.io/v1alpha1
      kind: NodeResourceTopology
      metadata:
        annotations:
          k8stopoawareschedwg/rte-update: periodic
        creationTimestamp: "2022-06-16T08:55:38Z"
        generation: 63760
        name: worker-0
        resourceVersion: "8450223"
        uid: 8b77be46-08c0-4074-927b-d49361471590
      topologyPolicies:
      - SingleNUMANodeContainerLevel
      zones:
      - costs:
        - name: node-0
          value: 10
        - name: node-1
          value: 21
        name: node-0
        resources:
        - allocatable: "38"
          available: "38"
          capacity: "40"
          name: cpu
        - allocatable: "134217728"
          available: "134217728"
          capacity: "134217728"
          name: hugepages-2Mi
        - allocatable: "262352048128"
          available: "262352048128"
          capacity: "270107316224"
          name: memory
        - allocatable: "6442450944"
          available: "6442450944"
          capacity: "6442450944"
          name: hugepages-1Gi
        type: Node
      - costs:
        - name: node-0
          value: 21
        - name: node-1
          value: 10
        name: node-1
        resources:
        - allocatable: "268435456"
          available: "268435456"
          capacity: "268435456"
          name: hugepages-2Mi
        - allocatable: "269231067136"
          available: "269231067136"
          capacity: "270573244416"
          name: memory
        - allocatable: "40"
          available: "40"
          capacity: "40"
          name: cpu
        - allocatable: "1073741824"
          available: "1073741824"
          capacity: "1073741824"
          name: hugepages-1Gi
        type: Node
    - apiVersion: topology.node.k8s.io/v1alpha1
      kind: NodeResourceTopology
      metadata:
        annotations:
          k8stopoawareschedwg/rte-update: periodic
        creationTimestamp: "2022-06-16T08:55:37Z"
        generation: 62061
        name: worker-1
        resourceVersion: "8450129"
        uid: e8659390-6f8d-4e67-9a51-1ea34bba1cc3
      topologyPolicies:
      - SingleNUMANodeContainerLevel
      zones: 1
      - costs:
        - name: node-0
          value: 10
        - name: node-1
          value: 21
        name: node-0
        resources: 2
        - allocatable: "38"
          available: "38"
          capacity: "40"
          name: cpu
        - allocatable: "6442450944"
          available: "6442450944"
          capacity: "6442450944"
          name: hugepages-1Gi
        - allocatable: "134217728"
          available: "134217728"
          capacity: "134217728"
          name: hugepages-2Mi
        - allocatable: "262391033856"
          available: "262391033856"
          capacity: "270146301952"
          name: memory
        type: Node
      - costs:
        - name: node-0
          value: 21
        - name: node-1
          value: 10
        name: node-1
        resources:
        - allocatable: "40"
          available: "40"
          capacity: "40"
          name: cpu
        - allocatable: "1073741824"
          available: "1073741824"
          capacity: "1073741824"
          name: hugepages-1Gi
        - allocatable: "268435456"
          available: "268435456"
          capacity: "268435456"
          name: hugepages-2Mi
        - allocatable: "269192085504"
          available: "269192085504"
          capacity: "270534262784"
          name: memory
        type: Node
    kind: List
    metadata:
      resourceVersion: ""
      selfLink: ""

    1
    zones 下的每个小节都描述了单个 NUMA 区域的资源。
    2
    resources 描述了 NUMA 区域资源的当前状态。检查 items.zones.resources.available 下列出的资源是否与分配给每个有保证的 pod 的独有的 NUMA 区资源对应。

6.6.1. 检查 NUMA 感知调度程序日志

通过查看日志来排除 NUMA 感知调度程序的问题。如果需要,可以通过修改 NUMAResourcesScheduler 资源的 spec.logLevel 字段来增加调度程序日志级别。可接受值为 NormalDebugTrace,其中 Trace 是最详细的选项。

注意

要更改辅助调度程序的日志级别,请删除正在运行的调度程序资源,并使用更改后的日志级别重新部署它。在此停机期间,调度程序无法调度新的工作负载。

先决条件

  • 安装 OpenShift CLI(oc)。
  • 以具有 cluster-admin 特权的用户身份登录。

流程

  1. 删除当前运行的 NUMAResourcesScheduler 资源:

    1. 运行以下命令来获取活跃的 NUMAResourcesScheduler

      $ oc get NUMAResourcesScheduler

      输出示例

      NAME                     AGE
      numaresourcesscheduler   90m

    2. 运行以下命令来删除二级调度程序资源:

      $ oc delete NUMAResourcesScheduler numaresourcesscheduler

      输出示例

      numaresourcesscheduler.nodetopology.openshift.io "numaresourcesscheduler" deleted

  2. 将以下 YAML 保存到文件 nro-scheduler-debug.yaml 中。本例将日志级别更改为 Debug

    apiVersion: nodetopology.openshift.io/v1alpha1
    kind: NUMAResourcesScheduler
    metadata:
      name: numaresourcesscheduler
    spec:
      imageSpec: "registry.redhat.io/openshift4/noderesourcetopology-scheduler-container-rhel8:v4.12"
      logLevel: Debug
  3. 运行以下命令,创建更新的 Debug logging NUMAResourcesScheduler 资源:

    $ oc create -f nro-scheduler-debug.yaml

    输出示例

    numaresourcesscheduler.nodetopology.openshift.io/numaresourcesscheduler created

验证步骤

  1. 检查 NUMA-aware 调度程序是否已成功部署:

    1. 运行以下命令检查 CRD 是否已创建成功:

      $ oc get crd | grep numaresourcesschedulers

      输出示例

      NAME                                                              CREATED AT
      numaresourcesschedulers.nodetopology.openshift.io                 2022-02-25T11:57:03Z

    2. 运行以下命令,检查新的自定义调度程序是否可用:

      $ oc get numaresourcesschedulers.nodetopology.openshift.io

      输出示例

      NAME                     AGE
      numaresourcesscheduler   3h26m

  2. 检查调度程序的日志是否显示增加的日志级别:

    1. 运行以下命令,获取在 openshift-numaresources 命名空间中运行的 pod 列表:

      $ oc get pods -n openshift-numaresources

      输出示例

      NAME                                               READY   STATUS    RESTARTS   AGE
      numaresources-controller-manager-d87d79587-76mrm   1/1     Running   0          46h
      numaresourcesoperator-worker-5wm2k                 2/2     Running   0          45h
      numaresourcesoperator-worker-pb75c                 2/2     Running   0          45h
      secondary-scheduler-7976c4d466-qm4sc               1/1     Running   0          21m

    2. 运行以下命令,获取二级调度程序 pod 的日志:

      $ oc logs secondary-scheduler-7976c4d466-qm4sc -n openshift-numaresources

      输出示例

      ...
      I0223 11:04:55.614788       1 reflector.go:535] k8s.io/client-go/informers/factory.go:134: Watch close - *v1.Namespace total 11 items received
      I0223 11:04:56.609114       1 reflector.go:535] k8s.io/client-go/informers/factory.go:134: Watch close - *v1.ReplicationController total 10 items received
      I0223 11:05:22.626818       1 reflector.go:535] k8s.io/client-go/informers/factory.go:134: Watch close - *v1.StorageClass total 7 items received
      I0223 11:05:31.610356       1 reflector.go:535] k8s.io/client-go/informers/factory.go:134: Watch close - *v1.PodDisruptionBudget total 7 items received
      I0223 11:05:31.713032       1 eventhandlers.go:186] "Add event for scheduled pod" pod="openshift-marketplace/certified-operators-thtvq"
      I0223 11:05:53.461016       1 eventhandlers.go:244] "Delete event for scheduled pod" pod="openshift-marketplace/certified-operators-thtvq"

6.6.2. 对资源拓扑 exporter 进行故障排除

通过检查对应的 resource-topology-exporter 日志,对发生意外结果的 noderesourcetopologlogies 对象进行故障排除。

注意

建议为它们引用的节点命名 NUMA 资源拓扑导出器实例。例如,名为 worker 的 worker 节点应具有对应的 noderesourcetopologies 对象,称为 worker

先决条件

  • 安装 OpenShift CLI(oc)。
  • 以具有 cluster-admin 特权的用户身份登录。

流程

  1. 获取由 NUMA Resources Operator 管理的守护进程集(daemonset)。每个守护进程在 NUMAResourcesOperator CR 中有一个对应的 nodeGroup。运行以下命令:

    $ oc get numaresourcesoperators.nodetopology.openshift.io numaresourcesoperator -o jsonpath="{.status.daemonsets[0]}"

    输出示例

    {"name":"numaresourcesoperator-worker","namespace":"openshift-numaresources"}

  2. 使用上一步中的 name 值获取所需的守护进程集的标签:

    $ oc get ds -n openshift-numaresources numaresourcesoperator-worker -o jsonpath="{.spec.selector.matchLabels}"

    输出示例

    {"name":"resource-topology"}

  3. 运行以下命令,使用 resource-topology 标签获取 pod:

    $ oc get pods -n openshift-numaresources -l name=resource-topology -o wide

    输出示例

    NAME                                 READY   STATUS    RESTARTS   AGE    IP            NODE
    numaresourcesoperator-worker-5wm2k   2/2     Running   0          2d1h   10.135.0.64   compute-0.example.com
    numaresourcesoperator-worker-pb75c   2/2     Running   0          2d1h   10.132.2.33   compute-1.example.com

  4. 检查与您要故障排除的节点对应的 worker pod 上运行的 resource-topology-exporter 容器的日志。运行以下命令:

    $ oc logs -n openshift-numaresources -c resource-topology-exporter numaresourcesoperator-worker-pb75c

    输出示例

    I0221 13:38:18.334140       1 main.go:206] using sysinfo:
    reservedCpus: 0,1
    reservedMemory:
      "0": 1178599424
    I0221 13:38:18.334370       1 main.go:67] === System information ===
    I0221 13:38:18.334381       1 sysinfo.go:231] cpus: reserved "0-1"
    I0221 13:38:18.334493       1 sysinfo.go:237] cpus: online "0-103"
    I0221 13:38:18.546750       1 main.go:72]
    cpus: allocatable "2-103"
    hugepages-1Gi:
      numa cell 0 -> 6
      numa cell 1 -> 1
    hugepages-2Mi:
      numa cell 0 -> 64
      numa cell 1 -> 128
    memory:
      numa cell 0 -> 45758Mi
      numa cell 1 -> 48372Mi

6.6.3. 更正缺少的资源拓扑 exporter 配置映射

如果您在配置了集群设置的集群中安装 NUMA Resources Operator,在有些情况下,Operator 会显示为 active,但资源拓扑 exporter (RTE) 守护进程集 pod 的日志显示 RTE 的配置缺失,例如:

Info: couldn't find configuration in "/etc/resource-topology-exporter/config.yaml"

此日志消息显示集群中未正确应用带有所需配置的 kubeletconfig,从而导致缺少 RTE configmap。例如,以下集群缺少 numaresourcesoperator-worker configmap 自定义资源 (CR):

$ oc get configmap

输出示例

NAME                           DATA   AGE
0e2a6bd3.openshift-kni.io      0      6d21h
kube-root-ca.crt               1      6d21h
openshift-service-ca.crt       1      6d21h
topo-aware-scheduler-config    1      6d18h

在正确配置的集群中,oc get configmap 也会返回一个 numaresourcesoperator-worker configmap CR。

先决条件

  • 安装 OpenShift Container Platform CLI(oc)。
  • 以具有 cluster-admin 权限的用户身份登录。
  • 安装 NUMA Resources Operator 并部署 NUMA 感知辅助调度程序。

流程

  1. 使用以下命令,比较 kubeletconfig 中的 spec.machineConfigPoolSelector.matchLabels 值和 MachineConfigPool (mcp) worker CR 中的 metadata.labels 的值:

    1. 运行以下命令来检查 kubeletconfig 标签:

      $ oc get kubeletconfig -o yaml

      输出示例

      machineConfigPoolSelector:
        matchLabels:
          cnf-worker-tuning: enabled

    2. 运行以下命令来检查 mcp 标签:

      $ oc get mcp worker -o yaml

      输出示例

      labels:
        machineconfiguration.openshift.io/mco-built-in: ""
        pools.operator.machineconfiguration.openshift.io/worker: ""

      cnf-worker-tuning: enabled 标签没有存在于 MachineConfigPool 对象中。

  2. 编辑 MachineConfigPool CR 使其包含缺少的标签,例如:

    $ oc edit mcp worker -o yaml

    输出示例

    labels:
      machineconfiguration.openshift.io/mco-built-in: ""
      pools.operator.machineconfiguration.openshift.io/worker: ""
      cnf-worker-tuning: enabled

  3. 应用标签更改并等待集群应用更新的配置。运行以下命令:

验证

  • 检查是否应用了缺少的 numaresourcesoperator-worker configmap CR:

    $ oc get configmap

    输出示例

    NAME                           DATA   AGE
    0e2a6bd3.openshift-kni.io      0      6d21h
    kube-root-ca.crt               1      6d21h
    numaresourcesoperator-worker   1      5m
    openshift-service-ca.crt       1      6d21h
    topo-aware-scheduler-config    1      6d18h