Is there any way to reduce the size of must-gathers from OpenShift 4?
Environment
- Red Hat OpenShift Container Platform (RHOCP)
- 4
oc
CLI
Issue
- In some clusters, usually with a big number of nodes, must-gathers can reach very large sizes (next to 100 GiB in some cases). This implies that creating, uploading, and uncompressing them can take a very long time and that has a significant impact on Red Hat's capacity to provide support promptly.
Resolution
A request for enhancement RFE-4568 was submitted concerning this topic and it has been accepted.
Due to the above mentioned RFE, starting with OpenShift 4.16 oc
binary and must-gather images, it was introduced as a Technology Preview teature the ability to allow filtering the logs collected by must-gather using since
and since-time
options for the must-gather sub-command.
That feature is GA starting with OpenShift 4.17 as per the release notes: new flags added for must-gather command.
Using the new options to filter the logs
Example of usage of the new options in OpenShift 4.16 (as a Technology Preview teature) and in OpenShift 4.17 and newer releases as GA:
$ oc adm must-gather --since=24h
$ oc adm must-gather --since-time=$(date -d '-24 hours' +%Y-%m-%dT%T.%9N%:z )
Note: for being able to use those options, the cluster and the
oc
binary needs to be 4.16 or newer.
Workaround for 4.15 and older releases
For Red Hat Openshift Container Platform versions 4.15 and earlier, the following command is applicable:
$ oc adm must-gather -- "sed -i 's#oc adm inspect#oc adm inspect --since=24h#g' /usr/bin/*gather* ; /usr/bin/gather"
It alters how the oc adm inspect
command operates within the must-gather procedure. More precisely, it employs the sed
command to locate occurrences of oc adm inspect
within the different gather*
scripts in /usr/bin/
(utilized internally by oc adm must-gather
), substituting them with oc adm inspect --since=24h
. The addition of --since=24h
restricts the inspection to data from the previous 24 hours (change the --since=24h
with the desired time). This adjusted command essentially compiles diagnostic data while focusing solely on information pertinent to the preceding 24 hours. This capability proves valuable for scrutinizing recent issues or events within the cluster.
Root Cause
In some clusters, the main reason for this problem is the rotated logs of the pods hosted in some OpenShift projects like openshift-sdn
. However, the reason why a must-gather becomes too large does not have to be the same in all cases. If other root causes are found, feel free to report them to Red Hat Support.
Diagnostic Steps
The new flags are only available starting with OpenShift 4.16 as a Technology Preview feature, and GA in OpenShift 4.17:
$ oc version
Client Version: 4.14.35
Server Version: 4.14.35
$ oc adm must-gather --since=24h
error: unknown flag: --since
See 'oc adm must-gather --help' for usage.
[...]
This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.
Comments