Is there any way to reduce the size of must-gathers from OpenShift 4?

Solution Verified - Updated -

Environment

  • Red Hat OpenShift Container Platform (RHOCP)
    • 4
  • oc CLI

Issue

  • In some clusters, usually with a big number of nodes, must-gathers can reach very large sizes (next to 100 GiB in some cases). This implies that creating, uploading, and uncompressing them can take a very long time and that has a significant impact on Red Hat's capacity to provide support promptly.

Resolution

A request for enhancement RFE-4568 was submitted concerning this topic and it has been accepted.

Due to the above mentioned RFE, starting with OpenShift 4.16 oc binary and must-gather images, it was introduced as a Technology Preview teature the ability to allow filtering the logs collected by must-gather using since and since-time options for the must-gather sub-command.
That feature is GA starting with OpenShift 4.17 as per the release notes: new flags added for must-gather command.

Using the new options to filter the logs

Example of usage of the new options in OpenShift 4.16 (as a Technology Preview teature) and in OpenShift 4.17 and newer releases as GA:

$ oc adm must-gather --since=24h
$ oc adm must-gather --since-time=$(date -d '-24 hours' +%Y-%m-%dT%T.%9N%:z )

Note: for being able to use those options, the cluster and the oc binary needs to be 4.16 or newer.

Workaround for 4.15 and older releases

For Red Hat Openshift Container Platform versions 4.15 and earlier, the following command is applicable:

$ oc adm must-gather -- "sed -i 's#oc adm inspect#oc adm inspect --since=24h#g' /usr/bin/*gather* ; /usr/bin/gather"

It alters how the oc adm inspect command operates within the must-gather procedure. More precisely, it employs the sed command to locate occurrences of oc adm inspect within the different gather* scripts in /usr/bin/ (utilized internally by oc adm must-gather), substituting them with oc adm inspect --since=24h. The addition of --since=24h restricts the inspection to data from the previous 24 hours (change the --since=24h with the desired time). This adjusted command essentially compiles diagnostic data while focusing solely on information pertinent to the preceding 24 hours. This capability proves valuable for scrutinizing recent issues or events within the cluster.

Root Cause

In some clusters, the main reason for this problem is the rotated logs of the pods hosted in some OpenShift projects like openshift-sdn. However, the reason why a must-gather becomes too large does not have to be the same in all cases. If other root causes are found, feel free to report them to Red Hat Support.

Diagnostic Steps

The new flags are only available starting with OpenShift 4.16 as a Technology Preview feature, and GA in OpenShift 4.17:

$ oc version
Client Version: 4.14.35
Server Version: 4.14.35

$ oc adm must-gather --since=24h
error: unknown flag: --since
See 'oc adm must-gather --help' for usage.
[...]

This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.

Comments