Installing and executing collectl in RHOCP 4

Solution Verified - Updated -

Environment

  • Red Hat OpenShift Container Platform (RHOCP)
    • 4

Issue

  • How to install and run collectl in RHOCP 4?

Resolution

Disclaimer The content noted herein is provided for convenience only and is not a supported solution by Red Hat. As such, Red Hat is not responsible for any issues incurred as a result of enacting the provided steps and can not assist in troubleshooting any issues which may arise in following the steps provided.

Install

  1. Label all the nodes you want to monitor (here, for example, all the nodes, except the masters) as collectl=true:

    $ oc get node -o name -l node-role.kubernetes.io/master!= | xargs -I {}  oc label {} collectl=true 
    
  2. Create the Namespace/collectl, ClusterRoleBinding/collectl-privileged, and DaemonSet/collectl resources:

    oc apply -k https://github.com/gmeghnag/ocp-collectl
    
  3. Switch to the collectl project:

    oc project collectl
    
  4. Confirm if collectl is running on the desired node/s:

    $ oc get pods
    NAME             READY   STATUS    RESTARTS   AGE
    collectl-clt4s   1/1     Running   0          34s
    collectl-tltzk   1/1     Running   0          34s
    collectl-zqr5r   1/1     Running   0          34s 
    
  5. At this point, collectl should be successfully running on the desired node/s. After a few minutes, collectl logs should begin showing up in /var/log/collectl/ on the node:

    $ ls /var/log/collectl/ 
    worker-0-20221004-133156.raw.gz worker-0-collectl-202210.log
    

Collect and extract collectl raw log files

  1. Create the directory collectl_out and collect collectl compressed raw log files into it:

    mkdir -p collectl_out; oc get node -l collectl=true -o name -o json | jq '.items[].metadata.name' -r | while read NODE; do oc debug node/${NODE} -q --to-namespace=openshift-etcd -- chroot host sh -c 'cd /var/log/collectl; ls *.raw.gz' | while read FILE; do oc debug node/${NODE} -q --to-namespace=openshift-etcd -- chroot host sh -c "cd /var/log/collectl; cat $FILE" > collectl_out/${FILE}; done ; done
    
  2. Extract the .raw log files:

    ls collectl_out | while read GZ; do cat collectl_out/${GZ} | zcat > collectl_out/$(printf $GZ | egrep -o ".*.raw"); done
    

Analyze the data

podman run --platform=linux/amd64 --rm -ti -v ${PWD}/collectl_out:/var/log/collectl quay.io/gmeghnag/collectl:4.3.20-ubi9 sh

Cleanup

Once all diagnostics are complete:

  1. Delete the Namespace/collectl, ClusterRoleBinding/collectl-privileged, and DaemonSet/collectl resources:

    oc delete -k https://github.com/gmeghnag/ocp-collectl.git
    
  2. Remove the collectl logs from the node/s:

    oc get node -o name -l collectl=true -o name | xargs -I {} oc debug {} -q --to-namespace=openshift-etcd -- chroot host sh -c 'rm -rf /var/log/collectl'
    
  3. Remove the collectl=true label from the node/s:

    oc get node -o name -l collectl=true | xargs -I {} oc label {} collectl- 
    

Configuration Changes

See this solution on how to modify the existing deployment of collectl with custom configuration parameters: https://access.redhat.com/solutions/7095759

This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.

Comments