How to use the collectl utility to troubleshoot performance issues in Red Hat Enterprise Linux
Collectl is neither shipped nor supported by Red Hat but is sometimes used by users and third party vendors.
Note! While a Red Hat Engineer now is a maintainer of the upstream collectl project on github, Red Hat still does not ship nor does it provide support for collectl on RHEL.
The following information has been provided by Red Hat, but is outside the scope of the posted Service Level Agreements and support procedures. Installing collectl does not render a system unsupportable by Red Hat Global Support Services; however, Red Hat Global Support Services will be unable to support or debug problems with collectl or resulting from installing and using collectl as it is not shipped in standard Red Hat Enterprise Linux channels. Installing third-party packages is done with the user's understanding of Red Hat's limitations in supporting issues with or resulting from the third-party packages.
How to obtain Collectl
The collectl community project is maintained at https://github.com/sharkcz/collectl ⧉ as well as provided in the Fedora community project. For Red Hat Enterprise Linux 6 and 7, the easiest way to install collectl is via the EPEL repositories (Extra Packages for Enterprise Linux) maintained by the Fedora community.
Note: Previously the main community project was located at http://collectl.sourceforge.net/ ⧉ -- but while still present, it lists the latest version as '4.3.1 Oct 31, 2018' at the top of that webpage. Whereas https://sourceforge.net/projects/collectl/files/collectl/ ⧉ lists the current version available as of June 2023 as '4.3.8 Feb 07, 2023'. The main place for updates has been moved to the above git hub per this comment on sourceforge.net.
Follow these instructions to set up the EPEL repositories. Once set up, collectl can be installed with the following command:
# yum install collectl
The packages are also available for direct download using the following links:
- RHEL 8 collectl for now needs to be directly downloaded from Sourceforge until collectl is added to RHEL 8 Epel:
wget https://sourceforge.net/projects/collectl/files/collectl/collectl-4.3.2/collectl-4.3.2.src.tar.gz/download -O /tmp/collectl-4.3.2.src.tar.gz tar -hxvf /tmp/collectl-4.3.2.src.tar.gz cd collectl ./INSTALL cd ../ systemctl start collectl # start data collection service on host systemctl enable collectl # optional: enable collectl server to be started at boot time ls -ltr /var/log/collectl/* # where output from collectl is kept
- RHEL 7 x86_64 https://archives.fedoraproject.org/pub/archive/epel/7/x86_64/
- RHEL 6 x86_64 https://archives.fedoraproject.org/pub/archive/epel/6/x86_64/
- RHEL 5 x86_64 (available in the EPEL archives) https://archive.fedoraproject.org/pub/archive/epel/5/x86_64/
Note!
collectl is now available in a git repo
A simple git clone o fthe below will get you what you need
https://github.com/sharkcz/collectl.git
After cloning
# cd collectl
# ./INSTALL
For RHEL7 +
# systemctl start collectl
# systemctl enable collectl
General usage of collectl
Enable Collectl
The collectl utility can be run manually via the command line or as a service. Data will be logged to /var/log/collectl/*.raw.gz
. The logs will be rotated every 24 hours by default. To run as a service:
# chkconfig collectl on ## Optional step, enabled in runlevel 3, to start at boot time
# service collectl start
Sample Intervals
When run manually from the command line, the first Interval value is 1.
When running as a service, default sample intervals are as show below. It might sometimes be desired to lower these to avoid averaging, such as 1,30,60.
# grep -i interval /etc/collectl.conf
#Interval = 10
#Interval2 = 60
#Interval3 = 120
Log file
When run automatically from the daemon, the output log file location is specified from within the /etc/collect.conf
setting. This can be changed to a new location is desired. Edit /etc/collectl.conf
and change the file path after the -f
option to filename at a new location.
grep -i daemoncommands /etc/collectl.conf | grep -v "^#"
DaemonCommands = -f /var/log/collectl -r00:00,7 -m -F60 -s+YZ -i1
Using collectl to troubleshoot disk or SAN storage performance
The defaults of 10s for all but process data which is collected at 60s intervals are best left as is, even for storage performance analysis.
The SAR Equivalence Matrix shows common SAR command equivalents to help experienced SAR users learn to use Collectl.
The following example command will view summary detail of the CPU, Network and Disk from the file /var/log/collectl/HOSTNAME-20130416-164506.raw.gz:
collectl -scnd -oT -p HOSTNAME-20130416-164506.raw.gz
# <----CPU[HYPER]-----><----------Disks-----------><----------Network---------->
#Time cpu sys inter ctxsw KBRead Reads KBWrit Writes KBIn PktIn KBOut PktOut
16:46:10 9 2 14470 20749 0 0 69 9 0 1 0 2
16:46:20 13 4 14820 22569 0 0 312 25 253 174 7 79
16:46:30 10 3 15175 21546 0 0 54 5 0 2 0 3
16:46:40 9 2 14741 21410 0 0 57 9 1 2 0 4
16:46:50 10 2 14782 23766 0 0 374 8 250 171 5 75
....
The next example will output the 1 minute period from 17:00 - 17:01.
collectl -scnd -oT --from 17:00 --thru 17:01 -p HOSTNAME-20130416-164506.raw.gz
# <----CPU[HYPER]-----><----------Disks-----------><----------Network---------->
#Time cpu sys inter ctxsw KBRead Reads KBWrit Writes KBIn PktIn KBOut PktOut
17:00:00 13 3 15870 25320 0 0 67 9 251 172 6 90
17:00:10 16 4 16386 24539 0 0 315 17 246 170 6 84
17:00:20 10 2 14959 22465 0 0 65 26 5 6 1 8
17:00:30 11 3 15056 24852 0 0 323 12 250 170 5 69
17:00:40 18 5 16595 23826 0 0 463 13 1 5 0 5
17:00:50 12 3 15457 23663 0 0 57 9 250 170 6 76
17:01:00 13 4 15479 24488 0 0 304 7 254 176 5 70
The next example will output Detailed Disk data.
collectl -scnD -oT -p HOSTNAME-20130416-164506.raw.gz
### RECORD 7 >>> tabserver <<< (1366318860.001) (Thu Apr 18 17:01:00 2013) ###
# CPU[HYPER] SUMMARY (INTR, CTXSW & PROC /sec)
# User Nice Sys Wait IRQ Soft Steal Idle CPUs Intr Ctxsw Proc RunQ Run Avg1 Avg5 Avg15 RunT BlkT
8 0 3 0 0 0 0 86 8 15K 24K 0 638 5 1.07 1.05 0.99 0 0
# DISK STATISTICS (/sec)
# <---------reads---------><---------writes---------><--------averages--------> Pct
#Name KBytes Merged IOs Size KBytes Merged IOs Size RWSize QLen Wait SvcTim Util
sda 0 0 0 0 304 11 7 44 44 2 16 6 4
sdb 0 0 0 0 0 0 0 0 0 0 0 0 0
dm-0 0 0 0 0 0 0 0 0 0 0 0 0 0
dm-1 0 0 0 0 5 0 1 4 4 1 2 2 0
dm-2 0 0 0 0 298 0 14 22 22 1 4 3 4
dm-3 0 0 0 0 0 0 0 0 0 0 0 0 0
dm-4 0 0 0 0 0 0 0 0 0 0 0 0 0
dm-5 0 0 0 0 0 0 0 0 0 0 0 0 0
dm-6 0 0 0 0 0 0 0 0 0 0 0 0 0
dm-7 0 0 0 0 0 0 0 0 0 0 0 0 0
dm-8 0 0 0 0 0 0 0 0 0 0 0 0 0
dm-9 0 0 0 0 0 0 0 0 0 0 0 0 0
dm-10 0 0 0 0 0 0 0 0 0 0 0 0 0
dm-11 0 0 0 0 0 0 0 0 0 0 0 0 0
# NETWORK SUMMARY (/sec)
# KBIn PktIn SizeIn MultI CmpI ErrsI KBOut PktOut SizeO CmpO ErrsO
253 175 1481 0 0 0 5 70 79 0 0
....
Commonly used options
These generate summary, which is the total of ALL data for a particular type
- b
- buddy info (memory fragmentationc cpu
- d
- disk
- f
- nfs
- i
- inodes
- j
- interrupts by CPU
- l
- lustre
- m
- memory
- n
- network
- s
- sockets
- t
- tcp
- x
- Interconnect
- y
- Slabs (system object caches)
These generate detail data, typically but not limited to the device level
- C
- individual CPUs, including interrupts if sj or sJ
- D
- individual Disks
- E
- environmental (fan, power, temp) [requires ipmitool]
- F
- nfs data
- J
- interrupts by CPU by interrupt number
- L
- lustre
- M
- memory numa/node
- N
- individual Networks
- T
- tcp details (lots of data!)
- X
- interconnect ports/rails (Infiniband/Quadrics)
- Y
- slabs/slubs
- Z
- processes
The most useful switches are listed here
- -sD
detailed disk data
- -sC
detailed CPU data
- -sN
detailed network data
- -sZ
detailed process data
9 Comments
nice info.
Sadly the bug 472850 is internal. So the reasons why collectl is not in RHEL are still unknown
@Peter, I have updated this article to clarify that the decision was made instead to standardize on Performance Co-Pilot (PCP). This document has been updated to reflect that and provide guidance on how to use PCP instead.
thank you, really helpful.
Some more examples:
any process with abc in the name
collectl -i:1 -sZ --procfilt fabc
data from interface with eth0 in the name
collectl -sN --netfilt eth0
Ho do I understand what each column represents & is there a better way to analyze which process was pegging the CPU and why?
collectl -sCZ /var/log/collectl/app031a-20220407-000000.raw.gz |grep tomcat 1564 tomcat 20 1 45 S 2G 191M 1 0.04 0.03 0 01:20:06 0 0 0 0 /usr/lib/jvm/jre/bin/java 40898 tomcat 20 40896 0 S 114M 2M 0 0.00 0.00 0 00:00.03 0 0 0 0 -ksh 40972 tomcat 20 40898 0 S 119M 2M 1 0.00 0.00 0 00:00.04 0 0 0 0 bash 44362 tomcat 20 40972 0 R 172M 22M 0 0.04 0.10 0 00:00.40 0 0 0 1 /usr/bin/perl 44363 tomcat 20 40972 0 S 110M 984K 0 0.00 0.00 0 00:00.00 0 0 0 0 grep 57805 tomcat 20 1 168 S 9G 3G 1 0.62 2.34 4 02:11:59 0 5 0 39 /usr/lib/jvm/jre/bin/java 1564 tomcat 20 1 45 S 2G 191M 1 0.04 0.03 0 01:20:06 0 0 0 0 /usr/lib/jvm/jre/bin/java
The "grep tomcat" portion of your command is filtering out the column headers.
If you change your options to collectl, it may change your column headers. So run it without the
grep
command to see what you need to add to your filter to give you what you need.Thanks. How do i run the command without grep against the logfile "/var/log/collectl/app031a-20220407-000000.raw.gz" ? So How do I analyze it to understand which process was doing what?
Just a heads up that the first wget command step for the instructions to install from Sourceforge for RHEL 8, you have it saving the file as "/tmp/collect-4.3.2.src.tar.gz" (notice is is 'collect' instead of 'collectl' 'collect') but then the tar command in the next step references "/tmp/collectl-4.3.2.src.tar.gz" ('l' is included in 'collectl' again). The rest of the commands are correct then, since the directory that is untarred is correctly named 'collectl'.
Easy to realize the issue and fix on-the-fly, but just thought I'd point it out.
Thanks!