4.2. Disk

The following sections showcase scripts that monitor disk and I/O activity.

4.2.1. Summarizing Disk Read/Write Traffic

This section describes how to identify which processes are performing the heaviest disk reads/writes to the system.

Example 4.9. disktop.stp

#!/usr/bin/stap
#
# Copyright (C) 2007 Oracle Corp.
#
# Get the status of reading/writing disk every 5 seconds,
# output top ten entries 
#
# This is free software,GNU General Public License (GPL);
# either version 2, or (at your option) any later version.
#
# Usage:
#  ./disktop.stp
#

global io_stat,device
global read_bytes,write_bytes

probe vfs.read.return {
  if ($return>0) {
    if (devname!="N/A") {/*skip read from cache*/
      io_stat[pid(),execname(),uid(),ppid(),"R"] += $return
      device[pid(),execname(),uid(),ppid(),"R"] = devname
      read_bytes += $return
    }
  }
}

probe vfs.write.return {
  if ($return>0) {
    if (devname!="N/A") { /*skip update cache*/
      io_stat[pid(),execname(),uid(),ppid(),"W"] += $return
      device[pid(),execname(),uid(),ppid(),"W"] = devname
      write_bytes += $return
    }
  }
}

probe timer.ms(5000) {
  /* skip non-read/write disk */
  if (read_bytes+write_bytes) {

    printf("\n%-25s, %-8s%4dKb/sec, %-7s%6dKb, %-7s%6dKb\n\n",
           ctime(gettimeofday_s()),
           "Average:", ((read_bytes+write_bytes)/1024)/5,
           "Read:",read_bytes/1024,
           "Write:",write_bytes/1024)

    /* print header */
    printf("%8s %8s %8s %25s %8s %4s %12s\n",
           "UID","PID","PPID","CMD","DEVICE","T","BYTES")
  }
  /* print top ten I/O */
  foreach ([process,cmd,userid,parent,action] in io_stat- limit 10)
    printf("%8d %8d %8d %25s %8s %4s %12d\n",
           userid,process,parent,cmd,
           device[process,cmd,userid,parent,action],
           action,io_stat[process,cmd,userid,parent,action])

  /* clear data */
  delete io_stat
  delete device
  read_bytes = 0
  write_bytes = 0  
}

probe end{
  delete io_stat
  delete device
  delete read_bytes
  delete write_bytes
}
Example 4.9, “disktop.stp” outputs the top ten processes responsible for the heaviest reads or writes to a disk. Example 4.10, “Example 4.9, “disktop.stp” Sample Output” displays a sample output for this script and includes the following data per listed process:
  • UID — user ID. A user ID of 0 refers to the root user.
  • PID — the ID of the listed process.
  • PPID — the process ID of the listed process's parent process.
  • CMD — the name of the listed process.
  • DEVICE — which storage device the listed process is reading from or writing to.
  • T — the type of action performed by the listed process; W refers to write, and R refers to read.
  • BYTES — the amount of data read to or written from disk.
The time and date in the output of Example 4.9, “disktop.stp” is returned by the functions ctime() and gettimeofday_s(). ctime() derives calendar time in terms of seconds passed since the start of the Unix time (January 1, 1970). gettimeofday_s() counts the actual number of seconds since the start of the Unix time, which gives a fairly accurate human-readable timestamp for the output.
In this script, the $return is a local variable that stores the actual number of bytes each process reads or writes from the virtual file system. $return can only be used in return probes (for example, vfs.read.return and vfs.read.return).

Example 4.10. Example 4.9, “disktop.stp” Sample Output

[...]
Mon Sep 29 03:38:28 2008 , Average:  19Kb/sec, Read: 7Kb, Write: 89Kb
UID      PID     PPID                       CMD   DEVICE    T    BYTES
0    26319    26294                   firefox     sda5    W        90229
0     2758     2757           pam_timestamp_c     sda5    R         8064
0     2885        1                     cupsd     sda5    W         1678
Mon Sep 29 03:38:38 2008 , Average:   1Kb/sec, Read: 7Kb, Write: 1Kb
UID      PID     PPID                       CMD   DEVICE    T    BYTES
0     2758     2757           pam_timestamp_c     sda5    R         8064
0     2885        1                     cupsd     sda5    W         1678

4.2.2. Tracking I/O Time For Each File Read or Write

This section describes how to monitor the amount of time it takes for each process to read from or write to any file. This is useful if you wish to determine what files are slow to load on a given system.

Example 4.11. iotime.stp

global start
global entry_io
global fd_io
global time_io

function timestamp:long() {
  return gettimeofday_us() - start
}

function proc:string() {
  return sprintf("%d (%s)", pid(), execname())
}

probe begin {
 start = gettimeofday_us()
}

global filenames
global filehandles
global fileread
global filewrite

probe syscall.open {
  filenames[pid()] = user_string($filename)
} 

probe syscall.open.return {
  if ($return != -1) {
    filehandles[pid(), $return] = filenames[pid()]
    fileread[pid(), $return] = 0
    filewrite[pid(), $return] = 0
  } else {
    printf("%d %s access %s fail\n", timestamp(), proc(), filenames[pid()])
  }
  delete filenames[pid()]
}

probe syscall.read {
  if ($count > 0) {
    fileread[pid(), $fd] += $count
  }
  t = gettimeofday_us(); p = pid()
  entry_io[p] = t
  fd_io[p] = $fd
}

probe syscall.read.return {
  t = gettimeofday_us(); p = pid()
  fd = fd_io[p]
  time_io[p,fd] <<< t - entry_io[p]
}

probe syscall.write {
  if ($count > 0) {
    filewrite[pid(), $fd] += $count
  }
  t = gettimeofday_us(); p = pid()
  entry_io[p] = t
  fd_io[p] = $fd
}

probe syscall.write.return {
  t = gettimeofday_us(); p = pid()
  fd = fd_io[p]
  time_io[p,fd] <<< t - entry_io[p]
}

probe syscall.close {
  if (filehandles[pid(), $fd] != "") {
    printf("%d %s access %s read: %d write: %d\n",  timestamp(), proc(),
           filehandles[pid(), $fd], fileread[pid(), $fd], filewrite[pid(), $fd])
    if (@count(time_io[pid(), $fd]))
      printf("%d %s iotime %s time: %d\n",  timestamp(), proc(),
             filehandles[pid(), $fd], @sum(time_io[pid(), $fd]))
   }
  delete fileread[pid(), $fd]
  delete filewrite[pid(), $fd]
  delete filehandles[pid(), $fd]
  delete fd_io[pid()]
  delete entry_io[pid()]
  delete time_io[pid(),$fd]
}
Example 4.11, “iotime.stp” tracks each time a system call opens, closes, reads from, and writes to a file. For each file any system call accesses, Example 4.11, “iotime.stp” counts the number of microseconds it takes for any reads or writes to finish and tracks the amount of data (in bytes) read from or written to the file.
Example 4.11, “iotime.stp” also uses the local variable $count to track the amount of data (in bytes) that any system call attempts to read or write. Note that $return (as used in Example 4.9, “disktop.stp” from Section 4.2.1, “Summarizing Disk Read/Write Traffic”) stores the actual amount of data read/written. $count can only be used on probes that track data reads or writes (e.g. syscall.read and syscall.write).

Example 4.12. Example 4.11, “iotime.stp” Sample Output

[...]
825946 3364 (NetworkManager) access /sys/class/net/eth0/carrier read: 8190 write: 0
825955 3364 (NetworkManager) iotime /sys/class/net/eth0/carrier time: 9
[...]
117061 2460 (pcscd) access /dev/bus/usb/003/001 read: 43 write: 0
117065 2460 (pcscd) iotime /dev/bus/usb/003/001 time: 7
[...]
3973737 2886 (sendmail) access /proc/loadavg read: 4096 write: 0
3973744 2886 (sendmail) iotime /proc/loadavg time: 11
[...]
  • A timestamp, in microseconds.
  • Process ID and process name.
  • An access or iotime flag.
  • The file accessed.
If a process was able to read or write any data, a pair of access and iotime lines should appear together. The access line's timestamp refers to the time that a given process started accessing a file; at the end of the line, it will show the amount of data read/written (in bytes). The iotime line will show the amount of time (in microseconds) that the process took in order to perform the read or write.
If an access line is not followed by an iotime line, it simply means that the process did not read or write any data.

4.2.3. Track Cumulative IO

This section describes how to track the cumulative amount of I/O to the system.

Example 4.13. traceio.stp

#! /usr/bin/env stap
# traceio.stp
# Copyright (C) 2007 Red Hat, Inc., Eugene Teo <eteo@redhat.com>
# Copyright (C) 2009 Kai Meyer <kai@unixlords.com>
#   Fixed a bug that allows this to run longer
#   And added the humanreadable function
#
# This program is free software; you can redistribute it and/or modify
# it under the terms of the GNU General Public License version 2 as
# published by the Free Software Foundation.
#

global reads, writes, total_io

probe vfs.read.return {
  reads[pid(),execname()] += $return
  total_io[pid(),execname()] += $return
}

probe vfs.write.return {
  writes[pid(),execname()] += $return
  total_io[pid(),execname()] += $return
}

function humanreadable(bytes) {
  if (bytes > 1024*1024*1024) {
    return sprintf("%d GiB", bytes/1024/1024/1024)
  } else if (bytes > 1024*1024) {
    return sprintf("%d MiB", bytes/1024/1024)
  } else if (bytes > 1024) {
    return sprintf("%d KiB", bytes/1024)
  } else {
    return sprintf("%d   B", bytes)
  }
}

probe timer.s(1) {
  foreach([p,e] in total_io- limit 10)
    printf("%8d %15s r: %12s w: %12s\n",
           p, e, humanreadable(reads[p,e]),
           humanreadable(writes[p,e]))
  printf("\n")
  # Note we don't zero out reads, writes and total_io,
  # so the values are cumulative since the script started.
}
Example 4.13, “traceio.stp” prints the top ten executables generating I/O traffic over time. In addition, it also tracks the cumulative amount of I/O reads and writes done by those ten executables. This information is tracked and printed out in 1-second intervals, and in descending order.
Note that Example 4.13, “traceio.stp” also uses the local variable $return, which is also used by Example 4.9, “disktop.stp” from Section 4.2.1, “Summarizing Disk Read/Write Traffic”.

Example 4.14. Example 4.13, “traceio.stp” Sample Output

[...]
           Xorg r:   583401 KiB w:        0 KiB
       floaters r:       96 KiB w:     7130 KiB
multiload-apple r:      538 KiB w:      537 KiB
           sshd r:       71 KiB w:       72 KiB
pam_timestamp_c r:      138 KiB w:        0 KiB
        staprun r:       51 KiB w:       51 KiB
          snmpd r:       46 KiB w:        0 KiB
          pcscd r:       28 KiB w:        0 KiB
     irqbalance r:       27 KiB w:        4 KiB
          cupsd r:        4 KiB w:       18 KiB
           Xorg r:   588140 KiB w:        0 KiB
       floaters r:       97 KiB w:     7143 KiB
multiload-apple r:      543 KiB w:      542 KiB
           sshd r:       72 KiB w:       72 KiB
pam_timestamp_c r:      138 KiB w:        0 KiB
        staprun r:       51 KiB w:       51 KiB
          snmpd r:       46 KiB w:        0 KiB
          pcscd r:       28 KiB w:        0 KiB
     irqbalance r:       27 KiB w:        4 KiB
          cupsd r:        4 KiB w:       18 KiB

4.2.4. I/O Monitoring (By Device)

This section describes how to monitor I/O activity on a specific device.

Example 4.15. traceio2.stp

#! /usr/bin/env stap

global device_of_interest

probe begin {
  /* The following is not the most efficient way to do this.
      One could directly put the result of usrdev2kerndev()
      into device_of_interest.  However, want to test out
      the other device functions */
  dev = usrdev2kerndev($1)
  device_of_interest = MKDEV(MAJOR(dev), MINOR(dev))
}

probe vfs.write, vfs.read
{
  if (dev == device_of_interest)
    printf ("%s(%d) %s 0x%x\n",
            execname(), pid(), probefunc(), dev)
}
Example 4.15, “traceio2.stp” takes 1 argument: the whole device number. To get this number, use stat -c "0x%D" directory, where directory is located on the device you wish to monitor.
The usrdev2kerndev() function converts the whole device number into the format understood by the kernel. The output produced by usrdev2kerndev() is used in conjunction with the MKDEV(), MINOR(), and MAJOR() functions to determine the major and minor numbers of a specific device.
The output of Example 4.15, “traceio2.stp” includes the name and ID of any process performing a read/write, the function it is performing (vfs_read or vfs_write), and the kernel device number.
The following example is an excerpt from the full output of stap traceio2.stp 0x805, where 0x805 is the whole device number of /home. /home resides in /dev/sda5, which is the device we wish to monitor.

Example 4.16. Example 4.15, “traceio2.stp” Sample Output

[...]
synergyc(3722) vfs_read 0x800005
synergyc(3722) vfs_read 0x800005
cupsd(2889) vfs_write 0x800005
cupsd(2889) vfs_write 0x800005
cupsd(2889) vfs_write 0x800005
[...]

4.2.5. Monitoring Reads and Writes to a File

This section describes how to monitor reads from and writes to a file in real time.

Example 4.17. inodewatch.stp

#! /usr/bin/env stap

probe vfs.write, vfs.read
{
  # dev and ino are defined by vfs.write and vfs.read
  if (dev == MKDEV($1,$2) # major/minor device
      && ino == $3)
    printf ("%s(%d) %s 0x%x/%u\n",
      execname(), pid(), probefunc(), dev, ino)
}
Example 4.17, “inodewatch.stp” takes the following information about the file as arguments on the command line:
  • The file's major device number.
  • The file's minor device number.
  • The file's inode number.
To get this information, use stat -c '%D %i' filename, where filename is an absolute path.
For instance, if you wish to monitor the /etc/crontab file, run stat -c '%D %i' /etc/crontab first. This outputs the following output:
805 1078319
805 is the base-16 (hexadecimal) device number. The lower two digits are the minor device number, and the upper digits are the major number. 1078319 is the inode number. To start monitoring /etc/crontab, run stap inodewatch.stp 0x8 0x05 1078319 (The 0x prefixes indicate base-16 values).
The output of this command contains the name and ID of any process performing a read/write, the function it is performing (vfs_read or vfs_write), the device number (in hex format), and the inode number. Example 4.18, “Example 4.17, “inodewatch.stp” Sample Output” contains the output of stap inodewatch.stp 0x8 0x05 1078319 (when cat /etc/crontab is executed while the script is running):

Example 4.18. Example 4.17, “inodewatch.stp” Sample Output

cat(16437) vfs_read 0x800005/1078319
cat(16437) vfs_read 0x800005/1078319

4.2.6. Monitoring Changes to File Attributes

This section describes how to monitor if any processes are changing the attributes of a targeted file, in real time.

Example 4.19. inodewatch2-simple.stp

global ATTR_MODE = 1

probe kernel.function("inode_setattr") {
  dev_nr = $inode->i_sb->s_dev
  inode_nr = $inode->i_ino

  if (dev_nr == ($1 << 20 | $2) # major/minor device
      && inode_nr == $3
      && $attr->ia_valid & ATTR_MODE)
    printf ("%s(%d) %s 0x%x/%u %o %d\n",
      execname(), pid(), probefunc(), dev_nr, inode_nr, $attr->ia_mode, uid())
}
Like Example 4.17, “inodewatch.stp” from Section 4.2.5, “Monitoring Reads and Writes to a File”, Example 4.19, “inodewatch2-simple.stp” takes the targeted file's device number (in integer format) and inode number as arguments. For more information on how to retrieve this information, see Section 4.2.5, “Monitoring Reads and Writes to a File”.
The output for Example 4.19, “inodewatch2-simple.stp” is similar to that of Example 4.17, “inodewatch.stp”, except that Example 4.19, “inodewatch2-simple.stp” also contains the attribute changes to the monitored file, as well as the ID of the user responsible (uid()). Example 4.20, “Example 4.19, “inodewatch2-simple.stp” Sample Output” shows the output of Example 4.19, “inodewatch2-simple.stp” while monitoring /home/joe/bigfile when user joe executes chmod 777 /home/joe/bigfile and chmod 666 /home/joe/bigfile.

Example 4.20. Example 4.19, “inodewatch2-simple.stp” Sample Output

chmod(17448) inode_setattr 0x800005/6011835 100777 500
chmod(17449) inode_setattr 0x800005/6011835 100666 500