Menu Close

Red Hat Training

A Red Hat training course is available for OpenShift Container Platform

26.4. 自定义检测的条件

您可以通过编辑 Node Problem Detector 配置映射,将节点问题检测程序配置为监视任何日志字符串。

节点问题检测器配置映射示例

apiVersion: v1
kind: ConfigMap
metadata:
  name: node-problem-detector
data:
  docker-monitor.json: |  1
    {
        "plugin": "journald", 2
        "pluginConfig": {
                "source": "docker"
        },
        "logPath": "/host/log/journal", 3
        "lookback": "5m",
        "bufferSize": 10,
        "source": "docker-monitor",
        "conditions": [],
        "rules": [              4
                {
                        "type": "temporary", 5
                        "reason": "CorruptDockerImage", 6
                        "pattern": "Error trying v2 registry: failed to register layer: rename /var/lib/docker/image/(.+) /var/lib/docker/image/(.+): directory not empty.*" 7
                }
        ]
    }
  kernel-monitor.json: |  8
    {
        "plugin": "journald", 9
        "pluginConfig": {
                "source": "kernel"
        },
        "logPath": "/host/log/journal", 10
        "lookback": "5m",
        "bufferSize": 10,
        "source": "kernel-monitor",
        "conditions": [                 11
                {
                        "type": "KernelDeadlock", 12
                        "reason": "KernelHasNoDeadlock", 13
                        "message": "kernel has no deadlock"  14
                }
        ],
        "rules": [
                {
                        "type": "temporary",
                        "reason": "OOMKilling",
                        "pattern": "Kill process \\d+ (.+) score \\d+ or sacrifice child\\nKilled process \\d+ (.+) total-vm:\\d+kB, anon-rss:\\d+kB, file-rss:\\d+kB"
                },
                {
                        "type": "temporary",
                        "reason": "TaskHung",
                        "pattern": "task \\S+:\\w+ blocked for more than \\w+ seconds\\."
                },
                {
                        "type": "temporary",
                        "reason": "UnregisterNetDevice",
                        "pattern": "unregister_netdevice: waiting for \\w+ to become free. Usage count = \\d+"
                },
                {
                        "type": "temporary",
                        "reason": "KernelOops",
                        "pattern": "BUG: unable to handle kernel NULL pointer dereference at .*"
                },
                {
                        "type": "temporary",
                        "reason": "KernelOops",
                        "pattern": "divide error: 0000 \\[#\\d+\\] SMP"
                },
                {
                        "type": "permanent",
                        "condition": "KernelDeadlock",
                        "reason": "AUFSUmountHung",
                        "pattern": "task umount\\.aufs:\\w+ blocked for more than \\w+ seconds\\."
                },
                {
                        "type": "permanent",
                        "condition": "KernelDeadlock",
                        "reason": "DockerHung",
                        "pattern": "task docker:\\w+ blocked for more than \\w+ seconds\\."
                }
        ]
    }

1
适用于容器镜像的规则和条件。
2 9
监控服务,以逗号分隔列表中。
3 10
监控服务日志的路径。
4 11
要监控的事件列表。
5 12
用于指示错误的标签(临时)或 NodeCondition(永久)。
6 13
描述错误的文本消息。
7 14
节点问题检测程序监视的错误消息。
8
适用于内核的规则和条件。

要配置节点问题检测程序,请添加或移除问题条件和事件。

  1. 使用文本编辑器编辑节点问题检测程序配置映射。

    $ oc edit configmap -n openshift-node-problem-detector node-problem-detector
  2. 根据需要删除、添加或编辑任何节点条件或事件。

    {
           "type": <`temporary` or `permanent`>,
           "reason": <free-form text describing the error>,
           "pattern": <log message to watch for>
    },

    例如:

    {
           "type": "temporary",
           "reason": "UnregisterNetDevice",
           "pattern": "unregister_netdevice: waiting for \\w+ to become free. Usage count = \\d+"
    },
  3. 重启正在运行的容器集以应用更改。要重启 pod,您可以删除所有现有的 pod:

    # oc delete pods -n openshift-node-problem-detector -l name=node-problem-detector
  4. 要将节点问题检测器输出显示到标准输出(stdout)和标准错误(stderr),在 DaemonSet 中为节点问题检测器添加以下内容:

    spec:
      template:
        spec:
          containers:
          - name: node-problem-detector
            command:
            - node-problem-detector
            - --alsologtostderr=true 1
            - --log_dir="/tmp" 2
            - --system-log-monitors=/etc/npd/kernel-monitor.json,/etc/npd/docker-monitor.json 3
    1
    将输出发送到标准输出(stdout)。
    2
    错误日志的路径。
    3
    插件配置文件的逗号分隔路径。