Introduction to storage performance analysis with PCP

Updated -

This article is an introduction and tutorial on basic storage performance analysis tasks using Performance Co-Pilot (PCP).

INTRODUCTION

Performance Co-Pilot is a system level performance monitoring and management suite, particularly suited for use in enterprise environments. PCP has shipped and been fully supported by Red Hat since RHEL 6.6 and in many other distributions (e.g. Fedora since F15, Debian, Ubuntu, etc.). It's a stable, actively maintained and developed open source project, with a supportive community, see the PCP Home Page and the project page on GITHUB and the git repo.
PCP Architecture

  • Metrics coverage - PCP captures just about everything exported by the Linux /proc and /sys pseudo filesystems, including per-process data. PCP is easily extensible and has 96 plugins available (at last count), e.g. Linux platform, Oracle, mmv (for Java/JVM), libvirt, container support and many many others, with new agents being added in every release.
  • Namespace and metadata - PCP has a uniform hierarchical metrics namespace with metadata for all metrics - data type (integer/string/block/event), semantics (counter/instant/etc), scaling units (time/count/space dimensions), help text and so forth. Generic monitoring tools can use this meta-data to do appropriate scaling, formatting and rate conversion of metric data, which mostly eliminates the need for custom monitoring applications. See for example pmrep(1) and pmchart(1) and other tools. For users familiar with legacy tools such as sysstat, top and so forth, PCP includes front-end tools with similar usage, see How does Performance Co-Pilot (PCP) compare with sysstat and the Side-by-side comparison of PCP tools with legacy tools.
  • Distributed Operation and Archive capture - PCP can export live data over secure IPC channels and also has logger services to capture archives for offsite or retrospective analysis. In essence, PCP has complete separation of the performance data capture (live or archive capture) from it's replay (live local, live remote, or archive replay). All PCP monitoring tools can do both live monitoring and archive replay and share common command line options, see PCPINTRO(1). This allows users to query live data (local or remote hosts), as well as traverse historical data for statistical reduction over whatever time period and sampling interval is required. PCP archives are also automatically included in sos-reports.
  • APIs - PCP has shipped in RHEL since RHEL6.6 and is included in all versions of RHEL7 and RHEL8. There are well documented, stable plugin and data-ingest libraries for developing new agents, and libraries for new monitoring tools. All APIs have C/C++, Python and Perl bindings. A Go binding is available too (search for PCP speed on github). There is also a REST API for web applications and a data-source plugin named grafana-pcp that allows PCP data to be visualized with Grafana.
  • Tools - PCP boasts a large variety of tools, with many console tools compatible with legacy sysstat tools (e.g. iostat, pidstat, vmstat, mpstat, uptime, free, atop, numastat, etc), as well as many generic tools for data probing and scripting. There are also GUI tools for advanced interactive analysis and report generation.
  • Security - PCP can be configured for localhost only (unix domain only with no inet domain sockets), or fully remote. When used for remote access, the PCP service daemons and libraries also support certificate based authentication with SSL encrypted IPC channels. A dedicated proxy daemon is also available for access through firewalls.
  • Documentation - PCP has extensive documentation, including two books and numerous white papers, HOWTOs, KCS solutions and articles (such as this one), etc. See references listed below.
  • Support - PCP is fully supported in RHEL. You can open a case to get assistance from Red Hat support, or you can engage the community directly.

STORAGE ANALYSIS TUTORIAL

We install and set up PCP with the logging services enabled on a small server (4 cores/4 disks/8G RAM). We also set up a fake workload and then capture some performance logs over several days. We then demonstrate basic storage performance analysis tasks using various PCP monitoring tools in archive log replay mode.
If you just want to learn about PCP monitoring tools without setting up a server to grab perf stats, then the PCP archives used in this tutorial are available for download:

# wget http://people.redhat.com/~mgoodwin/pcp/archives/goody.tgz
# tar xf goody.tgz

After unpacking the archive tarball, you can skip directly to Setup - PCP monitoring / analysis environment below.

Setup - Server Environment

QUICK START - Install the pcp-zeroconf package.
This will install, configure and enable PCP data collection as a standard pre-configured deployment suitable for most servers.

# dnf install pcp-zeroconf

For details, see - Installing and using the pcp-zeroconf package for Performance Co-Pilot.

The pcp-zeroconf package has dependencies on the PCP base package, non-gui CLI tools (in the pcp-system-tools package) and numerous packages containing the most commonly needed and essential PMDA agents, as well as the man pages in the pcp-doc package. The post-install scriptlets in the pcp-zeroconf package automatically set up the most common PCP server deployment for capturing PCP archives containing performance data.

DETAILS - Customized Installation

If additional customization is required, beyond what pcp-zeroconf sets up, install and configure PCP according to the documented KCS solutions for this task - this is applicable for any RHEL server from RHEL6.6 thru RHEL8.x. Packages are also available in all the major distros, Fedora, Debian, Ubuntu, etc. PCP is officially included in RHEL6.6 onwards.
- How do I install Performance Co-Pilot (PCP) on my RHEL server to capture performance logs

Enable the standard pmcd and pmlogger services and tweak the default logging interval, reducing it from 30 seconds down to 5 seconds:
- How can I change the default logging interval used by Performance Co-Pilot (PCP)?

Enable logging of per-process metrics with a 30 second interval for later replay with pcp-atop(1) and other PCP monitoring tools. This step isn't always needed for I/O performance analysis, but sometimes it's useful to check which processes are active, doing I/O and using other system resources:
- How can I customize the Performance Co-Pilot logging configuration

The resulting configuration captures roughly 0.5 GB/day and saves it in PCP archive logs below the standard PCP logging directory: /var/log/pcp. Recent versions of sos(1) include this directory (if present) in a standard sosreport. If an older version of sos(1) is in use, customers can manually create a gzipped tar file and upload it to the case:
- How do I gather performance data logs to upload to my support case using Performance Co-Pilot (PCP)

Setup - Fake Workload

For the following tutorial, we deploy a "fake" workload via a crontab entry to simulate backups running at 8pm every day for approximately two hours. This is just so we can demonstrate the monitoring tools by providing something to look for, i.e. a heavy read-only load on the storage, occurring daily. We use a fake workload because this KCS article is public, and hence we can't use any real customer data. A typical customer case might complain about "High I/O wait and poor server performance every evening, causing poor application performance on local servers and often machines in other timezones" .. sound familiar?

# 
# cat /etc/cron.d/fakebackups 
# Fake workload - every day at 8pm, pretend to run backups of /home
0  20  *  *  *  root  dd if=/dev/mapper/vg-home bs=1M of=/dev/null >/dev/null 2>&1

Setup - PCP monitoring / analysis environment

Install the pcp-system-tools package (at least) on your workstation. Also install pcp-gui for the pmchart charting tool and pcp-doc - you'll need the man pages too. These packages are NOT needed on the server - you only need the monitoring tools on the system you're intending to use for the analysis. Since PCP is portable and has both forward/backward version compatibility with complete separation of data capture from replay, the performance logs can be captured on the server of interest and then analysed later on any system with PCP monitoring tools installed. This applies to live monitoring and also to retrospective analysis from previously captured logs.

# dnf -y install pcp-system-tools pcp-doc

Wait for a while - then examine the PCP archives

After the PCP pmlogger service has been running for a few hours, days or the performance problem of interest has been reproduced, whatever - the resulting logs can be uploaded to the case or copied to your workstation:
- How do I gather performance data logs to upload to my support case using Performance Co-Pilot (PCP)
Note that PCP archives compress very well and the daily log rotation scripts do this automatically to minimize disk space. See the man page for pmlogger_daily(1) for details. Once unpacked, the logs for our fake customer server will be in a subdirectory, one for each host being logged - in the simple case, only one host is logged. In our case, the host is called goody and the logs are in /var/log/pcp/pmlogger/goody. After two days of logging, that directory contains the following files:

# ls -hl /var/log/pcp/pmlogger/goody
total 1.4G
-rw-r--r-- 1 pcp pcp 472M Jul 19 12:48 20160718.12.38.0
-rw-r--r-- 1 pcp pcp  86K Jul 19 12:48 20160718.12.38.index
-rw-r--r-- 1 pcp pcp 3.2M Jul 19 12:47 20160718.12.38.meta
-rw-r--r-- 1 pcp pcp 474M Jul 20 13:04 20160719.12.55.0
-rw-r--r-- 1 pcp pcp  86K Jul 20 13:05 20160719.12.55.index
-rw-r--r-- 1 pcp pcp 9.2M Jul 20 13:01 20160719.12.55.meta
-rw-r--r-- 1 pcp pcp 418M Jul 21 10:18 20160720.13.25.0
-rw-r--r-- 1 pcp pcp  74K Jul 21 10:18 20160720.13.25.index
-rw-r--r-- 1 pcp pcp 4.2M Jul 21 10:15 20160720.13.25.meta
-rw-r--r-- 1 pcp pcp  216 Jul 20 13:25 Latest
-rw-r--r-- 1 pcp pcp  11K Jul 20 13:25 pmlogger.log
-rw-r--r-- 1 pcp pcp  11K Jul 20 13:05 pmlogger.log.prior

Note that in this example, the archive files have already been decompressed and are ready for use in the tutorial below. PCP supports transparent archive decompression, so it it is strictly not necessary to decompress. However, the tools operate faster if the archives are already decompressed for obvious reasons.

PCP logs are rolled over daily by the pmlogger_daily(1) service under the control of a system(1) timer. Each archive basename uses the date and time the archive was created in the format YYMMDD.HH.MM., and consists of at least three files: a temporal index (.index), meta data (.meta), and one or more data volumes (.0, `.1', etc). In the above listing, we can see there are daily logs for July 18, 19 and 20th. Note that when a PCP archive name is supplied as the argument to a monitoring tool, only the archive basename should be supplied. PCP client tools can also replay all archives for a particular host just by naming the directory.

We can examine the archive label to determine the time bounds using the pmdumplog -L command, e.g.

# cd /var/log/pcp/pmlogger/goody
# pmdumplog -L 20160718.12.38
Log Label (Log Format Version 2)
Performance metrics from host goody
  commencing Mon Jul 18 12:38:49.106 2016
  ending     Tue Jul 19 12:48:44.134 2016
Archive timezone: AEST-10
PID for pmlogger: 12976

Other files in the pmlogger/goody directory are
- pmlogger.log - this is the log for the pmlogger service - it details which metrics are being logged and at what logging interval. Different groups of metrics can be logged at different rates. In our fake customer site case, we're logging most data at 5 seconds and per-process data at 30 second intervals. The pmlogger.log file also provides an indication of the approximate logging data volume per day - this is useful when configuring the logging interval and for /var filesystem capacity planning.

Timezones and the -z flag (important)

Note in the pmdumplog -L listing above, the timezone of the server is listed in the log label header. When using a monitoring tool to replay a PCP archive, the default reporting timezone is the local timezone of the host the monitoring tool is being run on. This is because many PCP monitoring tools can replay multiple archives, which may have been captured on hosts in different timezones around the world. In practice, when replaying an archive from a customer site, we are usually more interested in the customer's timezone and this is often different to your timezone. To use the timezone of the server rather than your local timezone (if they are different), use the -z flag with the monitoring tool you're using - that way your analysis will make the most sense when reported to the customer and the timestamps reported in syslog and other logs will correlate correctly. You can also use the -Z zone flag to specify that the reporting timezone to use is zone, e.g. -Z EDT or -Z PST, -Z UTC, etc. In general, you should pretty much always use the -z flag (perhaps it should be the default).

Merging multiple archives into one with pmlogextract

Often you will want to use a monitoring tool to examine several days or even weeks of performance data - as a summary and overview before zooming in to smaller time periods of interest. This is achieved easily using the pmlogextract tool to merge the archives into one larger archive, which you can then use with the -a flag passed to the monitoring tool :
- How can I merge multiple PCP archives into one for a system level performance analysis covering multiple days or weeks
For example:

# pwd
/var/log/pcp/pmlogger/goody
# pmlogextract *.0 ../myarchive
# pmdumplog -L ../myarchive
Log Label (Log Format Version 2)
Performance metrics from host goody
  commencing Mon Jul 18 12:38:49.106 2016
  ending     Thu Jul 21 08:13:12.404 2016
Archive timezone: AEST-10
PID for pmlogger: 22379

As you can see, myarchive spans several days; this will allow us to examine the data over several days before zooming in on times of interest - a very useful feature. The pmlogextract tool can also extract only certain metrics if needed, see the man page. Also note that recent versions of PCP (pcp-3.11.2 and later) allow the -a flag to be specified more than once, and the tool will transparently merge the archives "on the fly". In addition, as mentioned above, the argument to the -a flag can be a directory name, containing multiple archives. This is particularly handy for merging "on the fly", but note all archives in the same directory should be for one host only.

Using pmchart for a performance overview spanning days

So we can now use myarchive report a performance summary with any PCP monitoring tool. For example, let's just examine CPU usage over the three days to see if there's any excessive I/O Wait time (idle time with pending I/O):

# pmchart -z -a myarchive -t 10m -O-0 -s 400 -v 400 -geometry 800x450 -c CPU -c Loadavg

So that's a 10 minute sampling interval (-t 10m) with 400 intervals (-s 400 -v 400) and thus spanning (10 x 400 / 60) = 66 hours, or just under 3 days. We've used the timezone of the sever (-z), offset to the end of the archive (-O-0 that's -O negative zero, meaning zero seconds before the end of the archive), 800x450 window size, and to pre-load two pmchart "views" (-c CPU -c Loadavg) :
CPU and Loadavg summary over 3 days
Clearly there are daily load average spikes at 8pm each evening along with elevated I/O wait time (aqua in the CPU chart at the top) - in this simplistic example, you could upload that image to the case and ask the customer what applications run at 8pm every evening ... "oh, yeah the backups! case dismissed" :)

Using pmrep for text based output:

# pmrep -z -a myarchive -t 10m -p -S'@19:45' -s 20 kernel.all.load kernel.all.cpu.{user,sys,wait.total} disk.all.total_bytes
          k.a.load  k.a.load  k.a.load  k.a.c.user  k.a.c.sys  k.a.c.w.total  d.a.total_bytes
          1 minute  5 minute  15 minut                                                       
                                              util       util           util          Kbyte/s
19:45:00     0.010     0.020     0.050         N/A        N/A            N/A              N/A
19:55:00     0.010     0.020     0.050       0.012      0.004          0.009           12.507
20:05:00     1.100     0.690     0.330       0.016      0.055          0.292        44104.872
20:15:00     1.010     0.990     0.670       0.012      0.088          0.754        79864.410
20:25:00     1.050     1.060     0.860       0.012      0.074          0.908        66039.610
20:35:00     1.060     1.070     0.960       0.015      0.074          0.913        63837.190
20:45:00     1.310     1.160     1.020       0.012      0.070          0.792        62741.337
20:55:00     1.000     1.030     1.030       0.012      0.070          0.792        61975.538
21:05:00     1.000     1.060     1.060       0.016      0.071          0.882        60047.950
21:15:00     1.000     1.020     1.050       0.013      0.065          0.907        56705.562
21:25:00     1.000     1.020     1.050       0.012      0.060          0.907        52059.462
21:35:00     1.030     1.040     1.050       0.016      0.057          0.921        46970.013
21:45:00     1.020     1.110     1.090       0.012      0.047          0.960        41684.985
21:55:00     0.130     0.680     0.920       0.013      0.035          0.769        29428.372
22:05:00     0.000     0.130     0.510       0.016      0.007          0.013           25.687
22:15:00     0.000     0.020     0.270       0.012      0.004          0.010           13.060
22:25:00     0.000     0.010     0.150       0.014      0.006          0.012           44.713
22:35:00     0.000     0.010     0.080       0.017      0.008          0.010           17.532
22:45:00     0.030     0.030     0.050       0.012      0.004          0.010           13.690
22:55:00     0.000     0.010     0.050       0.012      0.005          0.009           14.037

In the above command, we've used pmrep to show timestamps (-p) with an update interval of 10 minutes (-t 10m), for 20 samples (-s 20), starting at 19:45pm (-S@19:45) and we've asked for the load averages (kernel.all.load), user, sys and wait CPU measurements (kernel.all.cpu.{user,sys,wait.total} and the total disk thruput (disk.all.total_bytes). So that's reporting the first I/O spike in the pmchart graph, on Jul 18th at around 8pm, which lasted about two hours.

Using pmiostat for storage performance analysis

The pmiostat tool is a CLI tool that can replay PCP archive logs. It's also known as pcp-iostat. The pmiostat report is very similar to it's namesake from sysstat(1), but more powerful in many ways.
Here is a similar example to the pmrep usage above - we're using pmiostat on the same archive, but we've chosen the second I/O spike on July 19th, and only asked for the report for the sda disk (using -R sda$ as a regex pattern) :

# pmiostat -z -a myarchive -x t -t 10m -S'@Jul 19 19:55:00' -s 10 -P0 -R 'sda$'
# Timestamp              Device       rrqm/s wrqm/s   r/s  w/s  rkB/s  wkB/s avgrq-sz avgqu-sz await r_await w_await %util
Tue Jul 19 20:05:00 2016 sda               0      0     0    2      0      5      2.7      0.0     7       8       7     1
Tue Jul 19 20:15:00 2016 sda               0      0     0    1      0      3      2.0      0.0     4      10       4     1
Tue Jul 19 20:25:00 2016 sda               0      0     0    1      0      3      2.0      0.0     4       0       4     1
Tue Jul 19 20:35:00 2016 sda               1      0   419    2  53299      4    126.8      1.6     4       4      26    81
Tue Jul 19 20:45:00 2016 sda               0      0    32    1   4108      3    121.5      0.1     4       4       6     8
Tue Jul 19 20:55:00 2016 sda               0      0     0    2      0      4      2.4      0.0     6      14       6     1
Tue Jul 19 21:05:00 2016 sda               0      0     0    2      0      5      2.5      0.0     5      11       5     1
Tue Jul 19 21:15:00 2016 sda               0      0     0    1      0      3      2.0      0.0     4       0       4     1
Tue Jul 19 21:25:00 2016 sda               0      0     0    2      0      4      2.4      0.0     6      15       6     1
Tue Jul 19 21:35:00 2016 sda               0      0     0    1      0      3      2.2      0.0     5       0       5     1

The columns in the above report are:

Column Name Description
rrqm/s The number of read requests expressed as a rate per-second that were merged during the reporting interval by the I/O scheduler.
wrqm/s The number of write requests expressed as a rate per-second that were merged during the reporting interval by the I/O scheduler.
r/s The number of read requests completed by the device (after merges), expressed as a rate per second during the reporting interval.
w/s The number of write requests completed by the device (after merges), expressed as a rate per second during the reporting interval.
rkB/s The average volume of data read from the device expressed as KBytes/second during the reporting interval.
wkB/s The average volume of data written to the device expressed as KBytes/second during the reporting interval.
avgrq-sz The average I/O request size for both reads and writes to the device expressed as Kbytes during the reporting interval.
avgqu-sz The average queue length of read and write requests to the device during the reporting interval.
await The average time in milliseconds that read and write requests were queued (and serviced) to the device during the reporting interval.
r_await The average time in milliseconds that read requests were queued (and serviced) to the device during the reporting interval.
w_await The average time in milliseconds that write requests were queued (and serviced) to the device during the reporting interval.
%util The percentage of time during the reporting interval that the device was busy processing requests. A value of 100% indicates device saturation.

Check what device-mapper volumes we have :

# pminfo -a myarchive -f hinv.map.dmname

hinv.map.dmname
    inst [0 or "vg-swap"] value "dm-0"
    inst [1 or "vg-root"] value "dm-1"
    inst [3 or "vg-libvirt"] value "dm-2"
    inst [5 or "vg-home"] value "dm-3"
    inst [7 or "docker-253:1-4720837-pool"] value "dm-4"
    inst [273 or "docker-253:1-4720837-bd20b3551af42d2a0a6a3ed96cccea0fd59938b70e57a967937d642354e04859"] value "dm-5"
    inst [274 or "docker-253:1-4720837-b4aca4676507814fcbd56f81f3b68175e0d675d67bbbe371d16d1e5b7d95594e"] value "dm-6"

The hinv.map.dmname metric is pretty useful, not just for listing the persistent LVM logical volume names, but also for mapping these to the current dm-X name used by the kernel - the latter are not persistent and may change across reboots and even when devices are removed and new devices created (see more on this below).

With pmiostat, the -x dm flag says to examine device-mapper statistics. Ok, let's use -x dm to check the logical volume stats, using the -R flag to only include logical volume names beginning with vg-, i.e. excluding the docker volumes:

# pmiostat -z -a myarchive -x t -x dm -t 10m -S'@Jul 19 19:55:00' -s 5 -P0 -R'vg-*'
# Timestamp              Device       rrqm/s wrqm/s   r/s  w/s  rkB/s  wkB/s avgrq-sz avgqu-sz await r_await w_await %util
Tue Jul 19 20:05:00 2016 vg-home           0      0   347    0  44025      0    126.8      1.0     3       3       0    50
Tue Jul 19 20:05:00 2016 vg-libvirt        0      0     0    2      0      5      2.6      0.0     7       8       7     1
Tue Jul 19 20:05:00 2016 vg-root           0      0     0    2      5     12      9.7      0.1    70      33      74     2
Tue Jul 19 20:05:00 2016 vg-swap           0      0     0    0      0      0      4.1      0.0    25      25       0     0
Tue Jul 19 20:15:00 2016 vg-home           0      0   631    0  79965      0    126.7      1.9     3       3       0   100
Tue Jul 19 20:15:00 2016 vg-libvirt        0      0     0    1      0      3      2.0      0.0     4      10       4     1
Tue Jul 19 20:15:00 2016 vg-root           0      0     0    1      4     10     10.6      0.1    66      41      73     3
Tue Jul 19 20:15:00 2016 vg-swap           0      0     0    0      0      0      4.0      0.0    68      68       0     0
Tue Jul 19 20:25:00 2016 vg-home           0      0   522    0  65960      0    126.4      2.0     4       4       0   100
Tue Jul 19 20:25:00 2016 vg-libvirt        0      0     0    1      0      3      2.0      0.0     4       0       4     1
Tue Jul 19 20:25:00 2016 vg-root           0      0     1    1     28     12     14.7      0.2    56      30      77     5
Tue Jul 19 20:25:00 2016 vg-swap           0      0     0    0      0      0      4.0      0.0    59      59       0     0
Tue Jul 19 20:35:00 2016 vg-home           0      0   501    0  63784      0    127.3      2.0     4       4       0   100
Tue Jul 19 20:35:00 2016 vg-libvirt        0      0     0    2      0      4      2.5      0.0    28      37      28     3
Tue Jul 19 20:35:00 2016 vg-root           0      0     1    2     12     12     11.3      0.1    55      34      62     2
Tue Jul 19 20:35:00 2016 vg-swap           0      0     0    0      0      0      4.0      0.0    33      33       0     0
Tue Jul 19 20:45:00 2016 vg-home           0      0   485    0  61965      0    127.8      2.0     4       4       0   100
Tue Jul 19 20:45:00 2016 vg-libvirt        0      0     0    1      0      3      2.2      0.0     7      12       7     1
Tue Jul 19 20:45:00 2016 vg-root           0      0     0    1      4     10     11.6      0.0    15      17      15     1
Tue Jul 19 20:45:00 2016 vg-swap           0      0     0    0      0      0      4.0      0.0    13      13       0     0

So, with our fake workload, it looks like the vg-home LVM logical volume has all the traffic and the others are relatively idle. Let's zoom in on that volume only using the -R flag:

# pmiostat -z -a myarchive -x t -x dm -t 10m -S'@Jul 19 19:45:00' -s 15 -P0 -R'vg-home'
# Timestamp              Device       rrqm/s wrqm/s   r/s  w/s  rkB/s  wkB/s avgrq-sz avgqu-sz await r_await w_await %util
Tue Jul 19 19:55:00 2016 vg-home           0      0     0    0      0      0      0.0      0.0     0       0       0     0
Tue Jul 19 20:05:00 2016 vg-home           0      0   347    0  44025      0    126.8      1.0     3       3       0    50
Tue Jul 19 20:15:00 2016 vg-home           0      0   631    0  79965      0    126.7      1.9     3       3       0   100
Tue Jul 19 20:25:00 2016 vg-home           0      0   522    0  65960      0    126.4      2.0     4       4       0   100
Tue Jul 19 20:35:00 2016 vg-home           0      0   501    0  63784      0    127.3      2.0     4       4       0   100
Tue Jul 19 20:45:00 2016 vg-home           0      0   485    0  61965      0    127.8      2.0     4       4       0   100
Tue Jul 19 20:55:00 2016 vg-home           0      0   479    0  61362      0    128.0      1.9     4       4       0   100
Tue Jul 19 21:05:00 2016 vg-home           0      0   467    0  59751      0    128.0      2.0     4       4       0   100
Tue Jul 19 21:15:00 2016 vg-home           0      0   440    0  56380      0    128.0      2.0     4       4       0   100
Tue Jul 19 21:25:00 2016 vg-home           0      0   405    0  51875      0    128.0      2.0     5       5       0   100
Tue Jul 19 21:35:00 2016 vg-home           0      0   367    0  46996      0    128.0      2.0     5       5       0   100
Tue Jul 19 21:45:00 2016 vg-home           0      0   326    0  41727      0    128.0      2.0     6       6       0   100
Tue Jul 19 21:55:00 2016 vg-home           0      0   242    0  31030      0    128.0      1.7     7       7       0    85
Tue Jul 19 22:05:00 2016 vg-home           0      0     0    0      0      0      0.0      0.0     0       0       0     0
Tue Jul 19 22:15:00 2016 vg-home           0      0     0    0      0      0      0.0      0.0     0       0       0     0

Checking the sos-report, the vg volume group uses the following physvols :

# 
# pvs
  PV         VG   Fmt  Attr PSize   PFree
  /dev/sda1  vg   lvm2 a--  232.88g    0 
  /dev/sdb2  vg   lvm2 a--  230.88g    0 
  /dev/sdc   vg   lvm2 a--  232.88g    0 

Note there is currently no PCP metric for mapping PVs to LVs, but it would be handy! Similarly, there is no mapping metric for listing device-mapper multipath alternate paths - that would be quite handy too. Such a metric, e.g. perhaps called hinv.map.multipath would map scsi device paths to WWID, which is a many:one mapping in a multipath environment.

So let's check those individual scsi devices with pmiostat:

# pmiostat -z -a myarchive -x t -t 10m -S'@Jul 19 19:45:00' -s 10 -P0 -R'sd[abc]'
# Timestamp              Device       rrqm/s wrqm/s   r/s  w/s  rkB/s  wkB/s avgrq-sz avgqu-sz await r_await w_await %util
Tue Jul 19 19:55:00 2016 sda               0      0     0    1      0      3      2.0      0.0     5       0       5     1
Tue Jul 19 19:55:00 2016 sdb               0      0     0    1      1     10     10.8      0.0     8      12       7     1
Tue Jul 19 19:55:00 2016 sdc               0      0     0    0      0      0      0.0      0.0     0       0       0     0
Tue Jul 19 20:05:00 2016 sda               0      0     0    2      0      5      2.7      0.0     7       8       7     1
Tue Jul 19 20:05:00 2016 sdb               0      0   347    1  44030     12    126.4      1.1     3       3      78    50
Tue Jul 19 20:05:00 2016 sdc               0      0     0    0      0      0      0.0      0.0     0       0       0     0
Tue Jul 19 20:15:00 2016 sda               0      0     0    1      0      3      2.0      0.0     4      10       4     1
Tue Jul 19 20:15:00 2016 sdb               1      0   630    1  79970     10    126.7      2.0     3       3      66   100
Tue Jul 19 20:15:00 2016 sdc               0      0     0    0      0      0      0.0      0.0     0       0       0     0
Tue Jul 19 20:25:00 2016 sda               0      0     0    1      0      3      2.0      0.0     4       0       4     1
Tue Jul 19 20:25:00 2016 sdb               1      0   522    1  65988     12    126.2      2.1     4       4      77   100
Tue Jul 19 20:25:00 2016 sdc               0      0     0    0      0      0      0.0      0.0     0       0       0     0
Tue Jul 19 20:35:00 2016 sda               1      0   419    2  53299      4    126.8      1.6     4       4      26    81
Tue Jul 19 20:35:00 2016 sdb               0      0    82    1  10496     12    125.5      0.5     6       5      69    19
Tue Jul 19 20:35:00 2016 sdc               0      0     0    0      0      0      0.0      0.0     0       0       0     0
Tue Jul 19 20:45:00 2016 sda               0      0    32    1   4108      3    121.5      0.1     4       4       6     8
Tue Jul 19 20:45:00 2016 sdb               0      0     0    1      4     10     12.8      0.0    17      16      17     1
Tue Jul 19 20:45:00 2016 sdc               0      0   452    0  57858      0    127.9      1.8     4       4       0    93
Tue Jul 19 20:55:00 2016 sda               0      0     0    2      0      4      2.4      0.0     6      14       6     1
Tue Jul 19 20:55:00 2016 sdb               0      0     0    1      3     10     12.1      0.0    24      21      24     1
Tue Jul 19 20:55:00 2016 sdc               0      0   479    0  61362      0    128.0      1.9     4       4       0   100
Tue Jul 19 21:05:00 2016 sda               0      0     0    2      0      5      2.5      0.0     5      11       5     1
Tue Jul 19 21:05:00 2016 sdb               0      0     0    1      7     12     10.5      0.1    34      20      38     1
Tue Jul 19 21:05:00 2016 sdc               0      0   467    0  59751      0    128.0      2.0     4       4       0   100
Tue Jul 19 21:15:00 2016 sda               0      0     0    1      0      3      2.0      0.0     4       0       4     1
Tue Jul 19 21:15:00 2016 sdb               0      0     0    1      2     10     12.0      0.0    19      15      19     1
Tue Jul 19 21:15:00 2016 sdc               0      0   440    0  56380      0    128.0      2.0     4       4       0   100
Tue Jul 19 21:25:00 2016 sda               0      0     0    2      0      4      2.4      0.0     6      15       6     1
Tue Jul 19 21:25:00 2016 sdb               0      0     4    1    150     11     32.4      0.0     9       3      31     1
Tue Jul 19 21:25:00 2016 sdc               0      0   405    0  51875      0    128.0      2.0     5       5       0   100

Note that the physvols in the vg volume group are linear, and we see traffic on sdb, then sda and then sdc become busy in turn, as our fake workload progresses in reading the three physvols comprising the vg-home volume group. If we instead had a stripe, we'd see simultaneous traffic on all three devices. We can visualize this more clearly with pmchart, e.g. :
pmchart per-disk spike on July 19th

Here's the pmchart IOSTAT view:
pmchart IOSTAT view for the whole archive - traffic tab

And here's the other pmchart tab showing the other iostat columns:
pmchart IOSTAT view for the whole archive - iostat tab

pmiostat - device selection and aggregation operators

A useful feature in pmiostat (version 3.11.2 or later) is the -R regex and -G sum|avg|min|max command line options. The regex selects a subset of the available devices (either sd devices, or device-mapper logical devices if the -x dm flag has been given). Only the selected devices will be reported, e.g.

# pmiostat -z -a myarchive -x t -t 10m -S'@Jul 19 19:45:00' -s 4 -P0 -R'sd[ab]$'
# Timestamp              Device       rrqm/s wrqm/s   r/s  w/s  rkB/s  wkB/s avgrq-sz avgqu-sz await r_await w_await %util
Tue Jul 19 19:55:00 2016 sda               0      0     0    1      0      3      2.0      0.0     5       0       5     1
Tue Jul 19 19:55:00 2016 sdb               0      0     0    1      1     10     10.8      0.0     8      12       7     1
Tue Jul 19 20:05:00 2016 sda               0      0     0    2      0      5      2.7      0.0     7       8       7     1
Tue Jul 19 20:05:00 2016 sdb               0      0   347    1  44030     12    126.4      1.1     3       3      78    50
Tue Jul 19 20:15:00 2016 sda               0      0     0    1      0      3      2.0      0.0     4      10       4     1
Tue Jul 19 20:15:00 2016 sdb               1      0   630    1  79970     10    126.7      2.0     3       3      66   100
Tue Jul 19 20:25:00 2016 sda               0      0     0    1      0      3      2.0      0.0     4       0       4     1
Tue Jul 19 20:25:00 2016 sdb               1      0   522    1  65988     12    126.2      2.1     4       4      77   100

Note the extended regex needs to be anchored with a trailing $ - otherwise we'll match sdaa, sdab, etc.
The -G flag allows aggregation of the statistics for the devices matching the regex supplied with the -R flag, or all devices by default if -R isn't supplied. As a practical example, we can check the HBTL scsi-id by querying the hinv.map.scsi metric, e.g. :

# pminfo -a myarchive -f hinv.map.scsi

hinv.map.scsi
    inst [0 or "scsi0:0:0:0 Direct-Access"] value "sda"
    inst [1 or "scsi1:0:0:0 Direct-Access"] value "sdb"
    inst [2 or "scsi2:0:0:0 Direct-Access"] value "sdc"
    inst [3 or "scsi3:0:0:0 Direct-Access"] value "sdd"

And let's say we want to sum the statistics for sda and sdb (e.g. because they might be alternate paths to the same target, or on the same PCI bus, whatever - and you're interested in summing that traffic) :

# pmiostat -z -a myarchive -x t -t 10m -S'@Jul 19 19:45:00' -s 4 -P0 -R'sd[ab]$' -Gsum
# Timestamp              Device       rrqm/s wrqm/s   r/s  w/s  rkB/s  wkB/s avgrq-sz avgqu-sz await r_await w_await %util
Tue Jul 19 19:55:00 2016 sum(sd[ab]$)      0      1     0    2      1     13     12.8      0.0    12      12      12     1
Tue Jul 19 20:05:00 2016 sum(sd[ab]$)      0      1   347    3  44030     17    129.1      1.1    10      11      85    51
Tue Jul 19 20:15:00 2016 sum(sd[ab]$)      1      0   630    2  79970     13    128.7      2.0     7      13      70   101
Tue Jul 19 20:25:00 2016 sum(sd[ab]$)      1      1   522    3  65988     15    128.2      2.1     8       4      80   101

Note that the device name column is now reported as the supplied regex pattern and that the %util column is always averaged, despite using the -Gsum flag.

dm-XX to logical volume name mapping

Internally, the kernel allocates device mapper devices dynamically and names them dm-XX. The mapping is dynamic and can change across reboots and a particular device mapper device name can be reused (if deleted and then a new device created). Kernel syslog messages usually report the dm-XX names, not the persistent logical device names used by LVM. So correlating the dynamic names used in kernel messages with LVM persistent names can be particularly arduous. Fortunately, PCP supports the hinv.map.dmname metric and we can use it in pmrep reports just like any other metric. Here is an example of the mapping :

# pminfo -a myarchive -f hinv.map.dmname

hinv.map.dmname
    inst [0 or "vg-swap"] value "dm-0"
    inst [1 or "vg-root"] value "dm-1"
    inst [3 or "vg-libvirt"] value "dm-2"
    inst [5 or "vg-home"] value "dm-3"
    inst [7 or "docker-253:1-4720837-pool"] value "dm-4"
    inst [273 or "docker-253:1-4720837-bd20b3551af42d2a0a6a3ed96cccea0fd59938b70e57a967937d642354e04859"] value "dm-5"
    inst [274 or "docker-253:1-4720837-b4aca4676507814fcbd56f81f3b68175e0d675d67bbbe371d16d1e5b7d95594e"] value "dm-6"

pmrep - other examples

When using pmrep, the hinv.map.dmname can be listed in it's own column, e.g.

# pmrep -a myarchive -X "Device Name" -z -g -t 10m -p -f%c -S'@19:45' -s 5 -i 'vg-.*' hinv.map.dmname disk.dm.read_bytes disk.dm.write_bytes
[ 1] - hinv.map.dmname["vg-swap"] - none
[ 1] - hinv.map.dmname["vg-root"] - none
[ 1] - hinv.map.dmname["vg-libvirt"] - none
[ 1] - hinv.map.dmname["vg-home"] - none
[ 2] - disk.dm.read_bytes["vg-swap"] - Kbyte/s
[ 2] - disk.dm.read_bytes["vg-root"] - Kbyte/s
[ 2] - disk.dm.read_bytes["vg-libvirt"] - Kbyte/s
[ 2] - disk.dm.read_bytes["vg-home"] - Kbyte/s
[ 3] - disk.dm.write_bytes["vg-swap"] - Kbyte/s
[ 3] - disk.dm.write_bytes["vg-root"] - Kbyte/s
[ 3] - disk.dm.write_bytes["vg-libvirt"] - Kbyte/s
[ 3] - disk.dm.write_bytes["vg-home"] - Kbyte/s

                                              1         2         3
Mon Jul 18 19:45:00 2016      vg-swap      dm-0       N/A       N/A
Mon Jul 18 19:45:00 2016      vg-root      dm-1       N/A       N/A
Mon Jul 18 19:45:00 2016   vg-libvirt      dm-2       N/A       N/A
Mon Jul 18 19:45:00 2016      vg-home      dm-3       N/A       N/A
Mon Jul 18 19:55:00 2016      vg-swap      dm-0     0.187     0.000
Mon Jul 18 19:55:00 2016      vg-root      dm-1     0.000     9.617
Mon Jul 18 19:55:00 2016   vg-libvirt      dm-2     0.000     2.703
Mon Jul 18 19:55:00 2016      vg-home      dm-3     0.000     0.000
Mon Jul 18 20:05:00 2016      vg-swap      dm-0     0.500     0.000
Mon Jul 18 20:05:00 2016      vg-root      dm-1     5.037    11.875
Mon Jul 18 20:05:00 2016   vg-libvirt      dm-2     0.020     5.337
Mon Jul 18 20:05:00 2016      vg-home      dm-3  44082.53     0.000
Mon Jul 18 20:15:00 2016      vg-swap      dm-0     0.240     0.000
Mon Jul 18 20:15:00 2016      vg-root      dm-1     8.888     9.778
Mon Jul 18 20:15:00 2016   vg-libvirt      dm-2     0.007     3.007
Mon Jul 18 20:15:00 2016      vg-home      dm-3  79842.49     0.000
Mon Jul 18 20:25:00 2016      vg-swap      dm-0     0.190     0.000
Mon Jul 18 20:25:00 2016      vg-root      dm-1    31.063    11.827
Mon Jul 18 20:25:00 2016   vg-libvirt      dm-2     0.013     3.093
Mon Jul 18 20:25:00 2016      vg-home      dm-3  65993.43     0.000

Note above we've used the -X flag with the heading "Device Name" to transpose the report (swap columns with rows). We've used the -g flag to list the details of each column heading before the main report. We've specified the -z flag to timestamps are in the timezone of the archive and -f%c to specify we want full ctime(3) timestamps rather than the default abbreviated version. We've also used the -i flag to restrict the report to only instances matching vg-.*. And finally, we've included both the dm-X and persistent LV names as separate columns in the report.

Another really handy feature of pmrep is pre-configured report columns. These are defined in pmrep.conf (usually found in /etc/pcp/pmrep/pmrep.conf). For example, the pre-configured :vmstat report can be combined with an additional column to show total disk traffic as follows:

# pmrep -a myarchive -z -t 10m -p -S'@19:45' -s 5 disk.all.total_bytes :vmstat
         d.a.total_bytes   r   b    swpd     free    buff   cache    si    so     bi     bo    in     cs  us  sy  id  wa  st
19:45:00             N/A   1   0 1460764  1261236 1968296 3504052   N/A   N/A    N/A    N/A   N/A    N/A N/A N/A N/A N/A N/A
19:55:00              13   1   0 1460664  1255956 1969544 3507748     0     0      0     12   283    387   0   0  99   0   0
20:05:00           44105   2   0 1460448    51976 6138692  540784     0     0  44088     18   848   1260   0   1  90   7   0
20:15:00           79864   1   1 1460300    51196 6229432  451360     0     0  79851     12  1336   1901   0   2  78  18   0
20:25:00           66040   1   2 1460180    50864 6226808  453452     0     0  66024     15  1333   1656   0   2  74  22   0

Notice the :vmstat specification is prefixed with :. There are many pre-defined report configuration entries in pmrep.conf - take a look. You can also define your own, as described in the pmrep(1) man page.

Non-interpolated mode

By default, all PCP monitoring tools use an interpolation mechanism, to allow the reporting interval to differ from the underlying native sampling interval. For details on how this works, see PCPIntro(1). Sometimes we do not want to use interpolation, e.g. when correlating blocktrace(1) data with pmiostat data, we will need to match timestamps exactly. To disable interpolation, use the -u flag with pmrep or pmiostat.

Note that the -u flag is incompatible with the -t flag - when not using interpolation the sampling interval cannot be changed, it's always the same as the underlying sampling interval that the data was originally recorded with.

Tools for reporting per-process I/O

The pcp-atop(1) command can be used to examine per-process I/O statistics, and a wealth of other system level performance data. Like all PCP tools, pcp-atop can replay archives, provided the appropriate pmlogconf settings have been enabled - see above. For per-process I/O analysis, use the pcp -a ARCHIVE atop -d flag. You probably also want to specify a fairly long sampling interval because atop runs fairly slowly on large archives, especially when proc data is present. There are a large number of options. For process I/O, the -d flag is most useful.

# pcp atop --help
Usage: pcp-atop [-flags] [interval [samples]]
        or
Usage: pcp-atop -w  file  [-S] [-a] [interval [samples]]
       pcp-atop -r  file  [-b hh:mm] [-e hh:mm] [-flags]

    generic flags:
      -a  show or log all processes (i.s.o. active processes only)
      -R  calculate proportional set size (PSS) per process
      -P  generate parseable output for specified label(s)
      -L  alternate line length (default 80) in case of non-screen output
      -f  show fixed number of lines with system statistics
      -F  suppress sorting of system resources
      -G  suppress exited processes in output
      -l  show limited number of lines for certain resources
      -y  show individual threads
      -1  show average-per-second i.s.o. total values

      -x  no colors in case of high occupation
      -g  show general process-info (default)
      -m  show memory-related process-info
      -d  show disk-related process-info
      -n  show network-related process-info
      -s  show scheduling-related process-info
      -v  show various process-info (ppid, user/group, date/time)
      -c  show command line per process
      -o  show own defined process-info
      -u  show cumulated process-info per user
      -p  show cumulated process-info per program (i.e. same name)

      -C  sort processes in order of cpu-consumption (default)
      -M  sort processes in order of memory-consumption
      -D  sort processes in order of disk-activity
      -N  sort processes in order of network-activity
      -A  sort processes in order of most active resource (auto mode)

    specific flags for raw logfiles:
      -w  write raw data to PCP archive folio
      -r  read  raw data from PCP archive folio
      -S  finish pcp-atop automatically before midnight (i.s.o. #samples)
      -b  begin showing data from specified time
      -e  finish showing data after specified time

    interval: number of seconds   (minimum 0)
    samples:  number of intervals (minimum 1)

Finding blocked processes

One common task in I/O performance analysis involves blocked tasks, i.e. processes in 'D' state or UN uninterruptible sleep. It's easy to check this with pmchart by plotting the proc.runq.runnable and proc.runq.blocked process counts, e.g.

# pmchart -z -a myarchive -t 10m -O-0 -s 400 -v 400 -geometry 800x450 -c CPU

pmchart running and blocked processes

To monitor specific processes, you can also use the pmrep and pcp-pidstat tools, and others. The pcp-pidstat tool is currently being developed with additional features to detect and monitor blocked processes.

References

Knowledge Base Articles and Solutions

White Papers, Guide Books and Case Studies and Presentations

1 Comments

Thank you Mark and team for this awesome work!