2.4. What to Monitor?

As stated earlier, the resources present in every system are CPU power, bandwidth, memory, and storage. At first glance, it would seem that monitoring would need only consist of examining these four different things.
Unfortunately, it is not that simple. For example, consider a disk drive. What things might you want to know about its performance?
  • How much free space is available?
  • How many I/O operations on average does it perform each second?
  • How long on average does it take each I/O operation to be completed?
  • How many of those I/O operations are reads? How many are writes?
  • What is the average amount of data read/written with each I/O?
There are more ways of studying disk drive performance; these points have only scratched the surface. The main concept to keep in mind is that there are many different types of data for each resource.
The following sections explore the types of utilization information that would be helpful for each of the major resource types.

2.4.1. Monitoring CPU Power

In its most basic form, monitoring CPU power can be no more difficult than determining if CPU utilization ever reaches 100%. If CPU utilization stays below 100%, no matter what the system is doing, there is additional processing power available for more work.
However, it is a rare system that does not reach 100% CPU utilization at least some of the time. At that point it is important to examine more detailed CPU utilization data. By doing so, it becomes possible to start determining where the majority of your processing power is being consumed. Here are some of the more popular CPU utilization statistics:
User Versus System
The percentage of time spent performing user-level processing versus system-level processing can point out whether a system's load is primarily due to running applications or due to operating system overhead. High user-level percentages tend to be good (assuming users are not experiencing unsatisfactory performance), while high system-level percentages tend to point toward problems that will require further investigation.
Context Switches
A context switch happens when the CPU stops running one process and starts running another. Because each context switch requires the operating system to take control of the CPU, excessive context switches and high levels of system-level CPU consumption tend to go together.
Interrupts
As the name implies, interrupts are situations where the processing being performed by the CPU is abruptly changed. Interrupts generally occur due to hardware activity (such as an I/O device completing an I/O operation) or due to software (such as software interrupts that control application processing). Because interrupts must be serviced at a system level, high interrupt rates lead to higher system-level CPU consumption.
Runnable Processes
A process may be in different states. For example, it may be:
  • Waiting for an I/O operation to complete
  • Waiting for the memory management subsystem to handle a page fault
In these cases, the process has no need for the CPU.
However, eventually the process state changes, and the process becomes runnable. As the name implies, a runnable process is one that is capable of getting work done as soon as it is scheduled to receive CPU time. However, if more than one process is runnable at any given time, all but one[4] of the runnable processes must wait for their turn at the CPU. By monitoring the number of runnable processes, it is possible to determine how CPU-bound your system is.
Other performance metrics that reflect an impact on CPU utilization tend to include different services the operating system provides to processes. They may include statistics on memory management, I/O processing, and so on. These statistics also reveal that, when system performance is monitored, there are no boundaries between the different statistics. In other words, CPU utilization statistics may end up pointing to a problem in the I/O subsystem, or memory utilization statistics may reveal an application design flaw.
Therefore, when monitoring system performance, it is not possible to examine any one statistic in complete isolation; only by examining the overall picture it it possible to extract meaningful information from any performance statistics you gather.


[4] Assuming a single-processor computer system.