Installing and executing collectl in RHOCP 4
Environment
- Red Hat OpenShift Container Platform (RHOCP)
- 4
Issue
- How to install and run collectl in RHOCP 4?
Resolution
Disclaimer The content noted herein is provided for convenience only and is not a supported solution by Red Hat. As such, Red Hat is not responsible for any issues incurred as a result of enacting the provided steps and can not assist in troubleshooting any issues which may arise in following the steps provided.
Install
-
Label all the nodes you want to monitor (here, for example, all the nodes, except the masters) as
collectl=true:$ oc get node -o name -l node-role.kubernetes.io/master!= | xargs -I {} oc label {} collectl=true -
Create the
Namespace/collectl,ClusterRoleBinding/collectl-privileged, andDaemonSet/collectlresources:oc apply -k https://github.com/gmeghnag/ocp-collectl -
Switch to the
collectlproject:oc project collectl -
Confirm if collectl is running on the desired node/s:
$ oc get pods NAME READY STATUS RESTARTS AGE collectl-clt4s 1/1 Running 0 34s collectl-tltzk 1/1 Running 0 34s collectl-zqr5r 1/1 Running 0 34s -
At this point, collectl should be successfully running on the desired node/s. After a few minutes, collectl logs should begin showing up in
/var/log/collectl/on the node:$ ls /var/log/collectl/ worker-0-20221004-133156.raw.gz worker-0-collectl-202210.log
Collect and extract collectl raw log files
-
Create the directory
collectl_outand collectcollectlcompressed raw log files into it:mkdir -p collectl_out; oc get node -l collectl=true -o name -o json | jq '.items[].metadata.name' -r | while read NODE; do oc debug node/${NODE} -q --to-namespace=openshift-etcd -- chroot host sh -c 'cd /var/log/collectl; ls *.raw.gz' | while read FILE; do oc debug node/${NODE} -q --to-namespace=openshift-etcd -- chroot host sh -c "cd /var/log/collectl; cat $FILE" > collectl_out/${FILE}; done ; done -
Extract the
.rawlog files:ls collectl_out | while read GZ; do cat collectl_out/${GZ} | zcat > collectl_out/$(printf $GZ | egrep -o ".*.raw"); done
Analyze the data
podman run --platform=linux/amd64 --rm -ti -v ${PWD}/collectl_out:/var/log/collectl quay.io/gmeghnag/collectl:4.3.20-ubi9 sh
Cleanup
Once all diagnostics are complete:
-
Delete the
Namespace/collectl,ClusterRoleBinding/collectl-privileged, andDaemonSet/collectlresources:oc delete -k https://github.com/gmeghnag/ocp-collectl.git -
Remove the collectl logs from the node/s:
oc get node -o name -l collectl=true -o name | xargs -I {} oc debug {} -q --to-namespace=openshift-etcd -- chroot host sh -c 'rm -rf /var/log/collectl' -
Remove the
collectl=truelabel from the node/s:oc get node -o name -l collectl=true | xargs -I {} oc label {} collectl-
Configuration Changes
See this solution on how to modify the existing deployment of collectl with custom configuration parameters: https://access.redhat.com/solutions/7095759
This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.
Comments