Collector 故障排除

Red Hat Advanced Cluster Security for Kubernetes 3.73

Collector 故障排除

Red Hat OpenShift Documentation Team

摘要

使用本指南了解如何在 Collector 中检索日志并调试问题。

第 1 章 检索和分析 Collector 日志和 pod 状态

故障排除中的第一步是检索日志和 pod 状态。日志可以帮助您识别错误的根本原因。另外,检查 pod 的最新状态可以了解有关失败的信息。

1.1. 检索 Collector 日志

首先,您应该检查来自失败的收集器的日志。根据您的环境和访问权限,您可以以两种方式获取这些日志:

1.1.1. 使用 ockubectl 命令检索日志

您可以使用 ockubectl 命令从正在运行的 Collector pod 获取日志。另外,如果当前的 Collector pod 重启了,您也可以检查上一个 Collector pod 的日志。

先决条件

  • 确保您有权列出 pod 和日志:

    $ oc auth can-i get pods && oc auth can-i get pods --subresource=logs 1
    1
    如果使用 Kubernetes,请输入 kubectl 而不是 oc

流程

  1. 列出带有标签 app=collector 的所有 pod:

    $ oc get pods -n stackrox -l app=collector 1
    1
    如果使用 Kubernetes,请输入 kubectl 而不是 oc

    输出示例

    collector-vclg5    1/2     CrashLoopBackOff   2 (25s ago)   2m41s+

  2. 获取 Collector pod 的日志:

    $ oc logs -n stackrox <collector_pod_name> collector 1
    1
    如果使用 Kubernetes,请输入 kubectl 而不是 oc。对于 <collector_pod_name >,请指定 Collector pod 的名称,如 collector-vclg5
  3. (可选)如果当前的 Collector pod 重启,您可以检查上一个 Collector pod 的日志:

    $ oc logs -n stackrox <collector_pod_name> collector --previous 1
    1
    如果使用 Kubernetes,请输入 kubectl 而不是 oc。对于 <collector_pod_name >,请指定 Collector pod 的名称,如 collector-vclg5

1.1.2. 从 RHACS 诊断捆绑包检索日志

您还可以通过从 Red Hat Advanced Cluster Security for Kubernetes (RHACS)用户界面下载诊断捆绑包来访问 Collector 日志。下载诊断捆绑包后,您可以检查所有 Collector pod 的日志。如需更多信息,请参阅 生成诊断捆绑包

1.2. 分析 Collector pod 状态

检查 pod 的最新状态是确定 Collector 崩溃原因的另一种简单方法。失败的信息会记录到最新的状态,并可使用 kubectl describe podoc describe pod 命令访问。

流程

  • 描述 Collector pod:

    $ oc describe pod -n stackrox <collector_pod_name> 1
    1
    如果使用 Kubernetes,请输入 kubectl 而不是 oc。对于 <collector_pod_name >,请指定 Collector pod 的名称,如 collector-vclg5

    输出示例

    [...]
        Last State:     Terminated
          Reason:       Error
          Message:      No suitable kernel object downloaded 1
          Exit Code:    1
          Started:      Fri, 21 Oct 2022 11:50:56 +0100
          Finished:     Fri, 21 Oct 2022 11:51:25 +0100
    [...]

    1
    在本例中,您可以看到 Collector 无法下载内核驱动程序。

第 2 章 常见的错误条件

当 Collector 配置其自身并下载系统内核驱动程序时,大多数错误都会发生。

下图显示了 Collector 启动过程的主要部分:

图 2.1. 收集器 Pod 启动过程

收集器 Pod 启动过程

如果启动流程的任何部分失败,日志会显示一个诊断概述,详细描述了哪些步骤是成功或失败的步骤。

以下日志文件示例显示成功启动:

[INFO    2022/11/28 13:21:55] == Collector Startup Diagnostics: ==
[INFO    2022/11/28 13:21:55]  Connected to Sensor?       true
[INFO    2022/11/28 13:21:55]  Kernel driver available?   true
[INFO    2022/11/28 13:21:55]  Driver loaded into kernel? true
[INFO    2022/11/28 13:21:55] ====================================

日志输出确认 Collector 连接到 Sensor 并载入内核驱动程序。您可以使用此日志来检查 Collector 的成功启动。

2.1. 无法连接到 Sensor

启动时,首先检查是否可以连接到 Sensor。Sensor 负责下载处理网络事件的内核驱动程序和 CIDR 块,使其成为启动过程的基本部分。以下日志表示您无法连接到 Sensor:

Collector Version: 3.12.0
OS: Ubuntu 20.04.4 LTS
Kernel Version: 5.4.0-126-generic
Starting StackRox Collector...
[INFO    2022/10/13 12:20:43] Hostname: 'hostname'
[...]
[INFO    2022/10/13 12:20:43] Sensor configured at address: sensor.stackrox.svc:9998
[INFO    2022/10/13 12:20:43] Attempting to connect to Sensor
[INFO    2022/10/13 12:21:13]
[INFO    2022/10/13 12:21:13] == Collector Startup Diagnostics: ==
[INFO    2022/10/13 12:21:13]  Connected to Sensor?       false
[INFO    2022/10/13 12:21:13]  Kernel driver available?   false
[INFO    2022/10/13 12:21:13]  Driver loaded into kernel? false
[INFO    2022/10/13 12:21:13] ====================================
[INFO    2022/10/13 12:21:13]
[FATAL   2022/10/13 12:21:13] Unable to connect to Sensor.

这个错误可能意味着 Sensor 没有正确启动,或者 Collector 配置不正确。要解决这个问题,您必须验证 Collector 配置,以确保 Sensor 地址正确,并且 Sensor pod 正确运行。

查看 Collector 日志,以专门检查配置的 Sensor 地址。或者,您可以运行以下命令:

$ kubectl -n stackrox get pod <collector_pod_name> -o jsonpath='{.spec.containers[0].env[?(@.name=="GRPC_SERVER")].value}' 1
1
对于 <collector_pod_name >,请指定 Collector pod 的名称,如 collector-vclg5

2.2. 内核驱动程序不可用

收集器决定它是否有节点的内核版本的内核驱动程序。收集器首先搜索带有正确版本和类型的驱动程序,然后尝试从 Sensor 下载驱动程序。以下日志表示没有本地内核驱动程序和 Sensor 中的驱动程序:

Collector Version: 3.12.0
OS: Alpine Linux v3.14
Kernel Version: 5.10.109-0-virt
Starting StackRox Collector...
[INFO    2022/10/13 13:32:57] Hostname: 'alpine'
[...]
[INFO    2022/10/13 13:32:57] Sensor configured at address: sensor.stackrox.svc:9999
[INFO    2022/10/13 13:32:57] Attempting to connect to Sensor
[INFO    2022/10/13 13:32:57] Successfully connected to Sensor.
[INFO    2022/10/13 13:32:57] Module version: 2.2.0
[INFO    2022/10/13 13:32:57] Attempting to find kernel module - Candidate kernel versions:
[INFO    2022/10/13 13:32:57] 5.10.109-0-virt
[INFO    2022/10/13 13:32:57] Local storage does not contain collector-5.10.109-0-virt.ko
[...]
[INFO    2022/10/13 13:32:57] Attempting to download kernel object from https://sensor.stackrox.svc/kernel-objects/2.2.0/collector-5.10.109-0-virt.ko.gz 1
[WARNING 2022/10/13 13:32:58] [Throttled] Unexpected HTTP request failure (HTTP 404) 2
[WARNING 2022/10/13 13:33:08] [Throttled] Unexpected HTTP request failure (HTTP 404)
[WARNING 2022/10/13 13:33:18] [Throttled] Unexpected HTTP request failure (HTTP 404)
[WARNING 2022/10/13 13:33:29] [Throttled] Unexpected HTTP request failure (HTTP 404)
[WARNING 2022/10/13 13:33:35] Attempted to download collector-5.10.109-0-virt.ko.gz 30 time(s)
[WARNING 2022/10/13 13:33:35] Failed to download from collector-5.10.109-0-virt.ko.gz
[WARNING 2022/10/13 13:33:35] Unable to download kernel object collector-5.10.109-0-virt.ko to /module/collector.ko.gz
[ERROR   2022/10/13 13:33:35] Error getting kernel object: collector-5.10.109-0-virt.ko
[INFO    2022/10/13 13:33:35]
[INFO    2022/10/13 13:33:35] == Collector Startup Diagnostics: ==
[INFO    2022/10/13 13:33:35]  Connected to Sensor?       true
[INFO    2022/10/13 13:33:35]  Kernel driver available?   false
[INFO    2022/10/13 13:33:35]  Driver loaded into kernel? false
[INFO    2022/10/13 13:33:35] ====================================
[INFO    2022/10/13 13:33:35]
[FATAL   2022/10/13 13:33:35]  No suitable kernel object downloaded for kernel 5.10.109-0-virt 3
1
日志显示会首先查找模块,然后是从 Sensor 下载驱动程序的努力。
2
404 错误表示节点的内核没有内核驱动程序。
3
由于缺少驱动程序,Collector 会进入 CrashLoopBackOff 状态。

内核版本 文件包含所有支持的内核版本的列表。

2.3. 无法载入内核驱动程序

在 Collector 启动前,它会加载内核驱动程序。但是,在个别情况下,您可能会遇到 Collector 无法加载内核驱动程序的问题,从而导致各种错误消息或例外。在这种情况下,您必须检查日志以识别加载内核驱动程序时失败的问题。

考虑以下 Collector 日志:

[INFO    2022/10/13 14:25:13] Hostname: 'hostname'
[...]
[INFO    2022/10/13 14:25:13] Successfully downloaded and decompressed /module/collector.ko
[INFO    2022/10/13 14:25:13]
[INFO    2022/10/13 14:25:13] This product uses kernel module and ebpf subcomponents licensed under the GNU
[INFO    2022/10/13 14:25:13] GENERAL PURPOSE LICENSE Version 2 outlined in the /kernel-modules/LICENSE file.
[INFO    2022/10/13 14:25:13] Source code for the kernel module and ebpf subcomponents is available upon
[INFO    2022/10/13 14:25:13] request by contacting support@stackrox.com.
[INFO    2022/10/13 14:25:13]
[...]
[INFO    2022/10/13 14:25:13] Inserting kernel module /module/collector.ko with indefinite removal and retry if required.
[ERROR   2022/10/13 14:25:13] Error inserting kernel module: /module/collector.ko: Operation not permitted. Aborting...
[ERROR   2022/10/13 14:25:13] Failed to insert kernel module
[ERROR   2022/10/13 14:25:13] Failed to setup Kernel module
[INFO    2022/10/13 14:25:13]
[INFO    2022/10/13 14:25:13] == Collector Startup Diagnostics: ==
[INFO    2022/10/13 14:25:13]  Connected to Sensor?       true
[INFO    2022/10/13 14:25:13]  Kernel driver available?   true
[INFO    2022/10/13 14:25:13]  Driver loaded into kernel? false
[INFO    2022/10/13 14:25:13] ====================================
[INFO    2022/10/13 14:25:13]
[FATAL   2022/10/13 14:25:13] Failed to initialize collector kernel components.

如果您遇到此类错误,不太可能自行修复。因此,将其报告给 Red Hat Advanced Cluster Security for Kubernetes (RHACS)支持团队,或创建 GitHub 问题

法律通告

Copyright © 2023 Red Hat, Inc.
The text of and illustrations in this document are licensed by Red Hat under a Creative Commons Attribution–Share Alike 3.0 Unported license ("CC-BY-SA"). An explanation of CC-BY-SA is available at http://creativecommons.org/licenses/by-sa/3.0/. In accordance with CC-BY-SA, if you distribute this document or an adaptation of it, you must provide the URL for the original version.
Red Hat, as the licensor of this document, waives the right to enforce, and agrees not to assert, Section 4d of CC-BY-SA to the fullest extent permitted by applicable law.
Red Hat, Red Hat Enterprise Linux, the Shadowman logo, the Red Hat logo, JBoss, OpenShift, Fedora, the Infinity logo, and RHCE are trademarks of Red Hat, Inc., registered in the United States and other countries.
Linux® is the registered trademark of Linus Torvalds in the United States and other countries.
Java® is a registered trademark of Oracle and/or its affiliates.
XFS® is a trademark of Silicon Graphics International Corp. or its subsidiaries in the United States and/or other countries.
MySQL® is a registered trademark of MySQL AB in the United States, the European Union and other countries.
Node.js® is an official trademark of Joyent. Red Hat is not formally related to or endorsed by the official Joyent Node.js open source or commercial project.
The OpenStack® Word Mark and OpenStack logo are either registered trademarks/service marks or trademarks/service marks of the OpenStack Foundation, in the United States and other countries and are used with the OpenStack Foundation's permission. We are not affiliated with, endorsed or sponsored by the OpenStack Foundation, or the OpenStack community.
All other trademarks are the property of their respective owners.