How to use the collectl utility to troubleshoot performance issues in Red Hat Enterprise Linux

Updated -

Collectl is neither shipped nor supported by Red Hat but is sometimes used by users and third party vendors.

Note! While a Red Hat Engineer now is a maintainer of the upstream collectl project on github, Red Hat still does not ship nor does it provide support for collectl on RHEL.

The following information has been provided by Red Hat, but is outside the scope of the posted Service Level Agreements and support procedures. Installing collectl does not render a system unsupportable by Red Hat Global Support Services; however, Red Hat Global Support Services will be unable to support or debug problems with collectl or resulting from installing and using collectl as it is not shipped in standard Red Hat Enterprise Linux channels. Installing third-party packages is done with the user's understanding of Red Hat's limitations in supporting issues with or resulting from the third-party packages.

How to obtain Collectl

The collectl community project is maintained at https://github.com/sharkcz/collectl ⧉ as well as provided in the Fedora community project.
For RHEL 6 and RHEL 7, the easiest way to install collectl is via the EPEL repositories (Extra Packages for Enterprise Linux) maintained by the Fedora community.

Note: Previously the main community project was located at http://collectl.sourceforge.net/ ⧉ -- but while still present, it lists the latest version as '4.3.1 Oct 31, 2018' at the top of that webpage. Whereas https://sourceforge.net/projects/collectl/files/collectl/ ⧉ lists the current version available as of June 2023 as '4.3.8 Feb 07, 2023'. The main place for updates has been moved to the above git hub per this comment on sourceforge.net.

Follow these instructions to set up the EPEL repositories. Once set up, collectl can be installed with the following command:

# yum install collectl     

The packages are also available for direct download using the following links:

  • RHEL 8 collectl for now needs to be directly downloaded from Sourceforge until collectl is added to RHEL 8 EPEL:

wget https://sourceforge.net/projects/collectl/files/collectl/collectl-4.3.2/collectl-4.3.2.src.tar.gz/download -O /tmp/collectl-4.3.2.src.tar.gz
tar -hxvf /tmp/collectl-4.3.2.src.tar.gz
cd collectl
./INSTALL
cd ../
systemctl start collectl        # start data collection service on host
systemctl enable collectl       # optional: enable collectl server to be started at boot time
ls -ltr /var/log/collectl/*     # where output from collectl is kept

Note!
collectl is now available in a git repo
A simple git clone of the below will get you what you need

While this is publicly accessible, the code changes and commits are only maintained and updated by Red Hat engineers

https://github.com/sharkcz/collectl.git

After cloning

# cd collectl
# ./INSTALL

For RHEL7 +

# systemctl start collectl
# systemctl enable collectl

General usage of collectl

Enable Collectl

The collectl utility can be run manually via the command line or as a service. Data will be logged to /var/log/collectl/*.raw.gz. The logs will be rotated every 24 hours by default. To run as a service:

# chkconfig collectl on       ## Optional step, enabled in runlevel 3, to start at boot time
# service collectl start

Sample Intervals

When run manually from the command line, the first Interval value is 1.
When running as a service, default sample intervals are as show below. It might sometimes be desired to lower these to avoid averaging, such as 1,30,60.

# grep -i interval /etc/collectl.conf 
#Interval =     10
#Interval2 =    60
#Interval3 =   120

Log file

When run automatically from the daemon, the output log file location is specified from within the /etc/collect.conf setting. This can be changed to a new location is desired. Edit /etc/collectl.conf and change the file path after the -f option to filename at a new location.

grep -i daemoncommands /etc/collectl.conf | grep -v "^#"
DaemonCommands = -f /var/log/collectl -r00:00,7 -m -F60 -s+YZ -i1

Using collectl to troubleshoot disk or SAN storage performance

The defaults of 10s for all but process data which is collected at 60s intervals are best left as is, even for storage performance analysis.

The SAR Equivalence Matrix shows common SAR command equivalents to help experienced SAR users learn to use Collectl.

The following example command will view summary detail of the CPU, Network and Disk from the file /var/log/collectl/HOSTNAME-20130416-164506.raw.gz:

collectl -scnd -oT -p HOSTNAME-20130416-164506.raw.gz
#         <----CPU[HYPER]-----><----------Disks-----------><----------Network---------->
#Time     cpu sys inter  ctxsw KBRead  Reads KBWrit Writes   KBIn  PktIn  KBOut  PktOut 
16:46:10    9   2 14470  20749      0      0     69      9      0      1      0       2 
16:46:20   13   4 14820  22569      0      0    312     25    253    174      7      79 
16:46:30   10   3 15175  21546      0      0     54      5      0      2      0       3 
16:46:40    9   2 14741  21410      0      0     57      9      1      2      0       4 
16:46:50   10   2 14782  23766      0      0    374      8    250    171      5      75 
....

The next example will output the 1 minute period from 17:00 - 17:01.

collectl -scnd -oT --from 17:00 --thru 17:01 -p HOSTNAME-20130416-164506.raw.gz
#         <----CPU[HYPER]-----><----------Disks-----------><----------Network---------->
#Time     cpu sys inter  ctxsw KBRead  Reads KBWrit Writes   KBIn  PktIn  KBOut  PktOut 
17:00:00   13   3 15870  25320      0      0     67      9    251    172      6      90 
17:00:10   16   4 16386  24539      0      0    315     17    246    170      6      84 
17:00:20   10   2 14959  22465      0      0     65     26      5      6      1       8 
17:00:30   11   3 15056  24852      0      0    323     12    250    170      5      69 
17:00:40   18   5 16595  23826      0      0    463     13      1      5      0       5 
17:00:50   12   3 15457  23663      0      0     57      9    250    170      6      76 
17:01:00   13   4 15479  24488      0      0    304      7    254    176      5      70 

The next example will output Detailed Disk data.

collectl -scnD -oT -p HOSTNAME-20130416-164506.raw.gz

### RECORD    7 >>> tabserver <<< (1366318860.001) (Thu Apr 18 17:01:00 2013) ###

# CPU[HYPER] SUMMARY (INTR, CTXSW & PROC /sec)
# User  Nice   Sys  Wait   IRQ  Soft Steal  Idle  CPUs  Intr  Ctxsw  Proc  RunQ   Run   Avg1  Avg5 Avg15 RunT BlkT
     8     0     3     0     0     0     0    86     8   15K    24K     0   638     5   1.07  1.05  0.99    0    0

# DISK STATISTICS (/sec)
#          <---------reads---------><---------writes---------><--------averages--------> Pct
#Name       KBytes Merged  IOs Size  KBytes Merged  IOs Size  RWSize  QLen  Wait SvcTim Util
sda              0      0    0    0     304     11    7   44      44     2    16      6    4
sdb              0      0    0    0       0      0    0    0       0     0     0      0    0
dm-0             0      0    0    0       0      0    0    0       0     0     0      0    0
dm-1             0      0    0    0       5      0    1    4       4     1     2      2    0
dm-2             0      0    0    0     298      0   14   22      22     1     4      3    4
dm-3             0      0    0    0       0      0    0    0       0     0     0      0    0
dm-4             0      0    0    0       0      0    0    0       0     0     0      0    0
dm-5             0      0    0    0       0      0    0    0       0     0     0      0    0
dm-6             0      0    0    0       0      0    0    0       0     0     0      0    0
dm-7             0      0    0    0       0      0    0    0       0     0     0      0    0
dm-8             0      0    0    0       0      0    0    0       0     0     0      0    0
dm-9             0      0    0    0       0      0    0    0       0     0     0      0    0
dm-10            0      0    0    0       0      0    0    0       0     0     0      0    0
dm-11            0      0    0    0       0      0    0    0       0     0     0      0    0

# NETWORK SUMMARY (/sec)
# KBIn  PktIn SizeIn  MultI   CmpI  ErrsI  KBOut PktOut  SizeO   CmpO  ErrsO
   253    175   1481      0      0      0      5     70     79      0      0
....

Commonly used options

These generate summary, which is the total of ALL data for a particular type
- b ­- buddy info (memory fragmentationc ­ cpu
- d ­- disk
- f ­- nfs
- i ­- inodes
- j ­- interrupts by CPU
- l ­- lustre
- m ­- memory
- n ­- network
- s ­- sockets
- t ­- tcp
- x - Interconnect
- y - Slabs (system object caches)

These generate detail data, typically but not limited to the device level
- C ­- individual CPUs, including interrupts if ­sj or ­sJ
- D ­- individual Disks
- E ­- environmental (fan, power, temp) [requires ipmitool]
- F ­- nfs data
- J ­- interrupts by CPU by interrupt number
- L ­- lustre
- M ­- memory numa/node
- N ­- individual Networks
- T ­- tcp details (lots of data!)
- X ­- interconnect ports/rails (Infiniband/Quadrics)
- Y ­- slabs/slubs
- Z ­- processes

The most useful switches are listed here
- -sD detailed disk data
- -sC detailed CPU data
- -sN detailed network data
- -sZ detailed process data

Comments