Chapter 25. Performance Counters for Linux (PCL) Tools and perf

Performance Counters for Linux (PCL) is a kernel-based subsystem that provides a framework for collecting and analyzing performance data. Red Hat Enterprise Linux 7 includes this kernel subsystem to collect data and the user-space tool perf to analyze the collected performance data. The PCL subsystem can be used to measure hardware events, including retired instructions and processor clock cycles. It can also measure software events, including major page faults and context switches. For example, PCL counters can compute Instructions Per Clock (IPC) from a process’s counts of instructions retired and processor clock cycles. A low IPC ratio indicates the code makes poor use of CPU. Other hardware events can also be used to diagnose poor CPU performance.

Performance counters can also be configured to record samples. The relative amount of samples can be used to identify which regions of code have the greatest impact on performance.

25.1. Perf Tool Commands

Useful perf commands include the following:

perf stat
This perf command provides overall statistics for common performance events, including instructions executed and clock cycles consumed. Options allow selection of events other than the default measurement events.
perf record
This perf command records performance data into a file which can be later analyzed using perf report.
perf report
This perf command reads the performance data from a file and analyzes the recorded data.
perf list
This perf command lists the events available on a particular machine. These events will vary based on performance monitoring hardware and software configuration of the system.

Use perf help to obtain a complete list of perf commands. To retrieve man page information on each perf command, use perf help command.

25.2. Using Perf

Using the basic PCL infrastructure for collecting statistics or samples of program execution is relatively straightforward. This section provides simple examples of overall statistics and sampling.

To collect statistics on make and its children, use the following command:

# perf stat -- make all

The perf command collects a number of different hardware and software counters. It then prints the following information:

Performance counter stats for 'make all':

  244011.782059  task-clock-msecs         #      0.925 CPUs
          53328  context-switches         #      0.000 M/sec
            515  CPU-migrations           #      0.000 M/sec
        1843121  page-faults              #      0.008 M/sec
   789702529782  cycles                   #   3236.330 M/sec
  1050912611378  instructions             #      1.331 IPC
   275538938708  branches                 #   1129.203 M/sec
     2888756216  branch-misses            #      1.048 %
     4343060367  cache-references         #     17.799 M/sec
      428257037  cache-misses             #      1.755 M/sec

  263.779192511  seconds time elapsed

The perf tool can also record samples. For example, to record data on the make command and its children, use:

# perf record -- make all

This prints out the file in which the samples are stored, along with the number of samples collected:

[ perf record: Woken up 42 times to write data ]
[ perf record: Captured and wrote 9.753 MB perf.data (~426109 samples) ]

Performance Counters for Linux (PCL) Tools conflict with OProfile

Both OProfile and Performance Counters for Linux (PCL) use the same hardware Performance Monitoring Unit (PMU). If OProfile is running while attempting to use the PCL perf command, an error message like the following occurs when starting OProfile:

Error: open_counter returned with 16 (Device or resource busy). /usr/bin/dmesg may provide additional information.

Fatal: Not all events could be opened.

To use the perf command, first shut down OProfile:

# opcontrol --deinit

You can then analyze perf.data to determine the relative frequency of samples. The report output includes the command, object, and function for the samples. Use perf report to output an analysis of perf.data. For example, the following command produces a report of the executable that consumes the most time:

# perf report --sort=comm

The resulting output:

# Samples: 1083783860000
#
# Overhead          Command
# ........  ...............
#
    48.19%         xsltproc
    44.48%        pdfxmltex
     6.01%             make
     0.95%             perl
     0.17%       kernel-doc
     0.05%          xmllint
     0.05%              cc1
     0.03%               cp
     0.01%            xmlto
     0.01%               sh
     0.01%          docproc
     0.01%               ld
     0.01%              gcc
     0.00%               rm
     0.00%              sed
     0.00%   git-diff-files
     0.00%             bash
     0.00%   git-diff-index

The column on the left shows the relative amount of samples. This output shows that make spends most of this time in xsltproc and pdfxmltex. To reduce the time for make to complete, focus on xsltproc and pdfxmltex. To list functions executed by xsltproc, run:

# perf report -n --comm=xsltproc

This generates:

comm: xsltproc
# Samples: 472520675377
#
# Overhead  Samples                    Shared Object  Symbol
# ........ ..........  .............................  ......
#
    45.54%215179861044  libxml2.so.2.7.6               [.] xmlXPathCmpNodesExt
    11.63%54959620202  libxml2.so.2.7.6               [.] xmlXPathNodeSetAdd__internal_alias
     8.60%40634845107  libxml2.so.2.7.6               [.] xmlXPathCompOpEval
     4.63%21864091080  libxml2.so.2.7.6               [.] xmlXPathReleaseObject
     2.73%12919672281  libxml2.so.2.7.6               [.] xmlXPathNodeSetSort__internal_alias
     2.60%12271959697  libxml2.so.2.7.6               [.] valuePop
     2.41%11379910918  libxml2.so.2.7.6               [.] xmlXPathIsNaN__internal_alias
     2.19%10340901937  libxml2.so.2.7.6               [.] valuePush__internal_alias