5.3.2. Profiling Cache Usage with Cachegrind

Cachegrind simulates your program's interaction with a machine's cache hierarchy and (optionally) branch predictor. It tracks usage of the simulated first-level instruction and data caches to detect poor code interaction with this level of cache; and the last-level cache, whether that is a second- or third-level cache, in order to track access to main memory. As such, programs run with Cachegrind run twenty to one hundred times slower than when run normally.
To run Cachegrind, execute the following command, replacing program with the program you wish to profile with Cachegrind:
# valgrind --tool=cachegrind program
Cachegrind can gather the following statistics for the entire program, and for each function in the program:
  • first-level instruction cache reads (or instructions executed) and read misses, and last-level cache instruction read misses;
  • data cache reads (or memory reads), read misses, and last-level cache data read misses;
  • data cache writes (or memory writes), write misses, and last-level cache write misses;
  • conditional branches executed and mispredicted; and
  • indirect branches executed and mispredicted.
Cachegrind prints summary information about these statistics to the console, and writes more detailed profiling information to a file (cachegrind.out.pid by default, where pid is the process ID of the program on which you ran Cachegrind). This file can be further processed by the accompanying cg_annotate tool, like so:
# cg_annotate cachegrind.out.pid

Note

cg_annotate can output lines longer than 120 characters, depending on the length of the path. To make the output clearer and easier to read, we recommend making your terminal window at least this wide before executing the aforementioned command.
You can also compare the profile files created by Cachegrind to make it simpler to chart program performance before and after a change. To do so, use the cg_diff command, replacing first with the initial profile output file, and second with the subsequent profile output file:
# cg_diff first second
This command produces a combined output file, which can be viewed in more detail with cg_annotate.
Cachegrind supports a number of options to focus its output. Some of the options available are:
--I1
Specifies the size, associativity, and line size of the first-level instruction cache, separated by commas: --I1=size,associativity,line size.
--D1
Specifies the size, associativity, and line size of the first-level data cache, separated by commas: --D1=size,associativity,line size.
--LL
Specifies the size, associativity, and line size of the last-level cache, separated by commas: --LL=size,associativity,line size.
--cache-sim
Enables or disables the collection of cache access and miss counts. The default value is yes (enabled).
Note that disabling both this and --branch-sim leaves Cachegrind with no information to collect.
--branch-sim
Enables or disables the collection of branch instruction and misprediction counts. This is set to no (disabled) by default, since it slows Cachegrind by approximately 25 per-cent.
Note that disabling both this and --cache-sim leaves Cachegrind with no information to collect.
For a full list of options, refer to the documentation included at /usr/share/doc/valgrind-version/valgrind_manual.pdf.