9.5. Viewing Storage Node Alerts

The storage nodes offer an elastic way of storing metrics information, which allows more metrics to be collected more frequently. Keeping the proper number of storage nodes for the specific infrastructure, number of resources, and number of metrics requires constant awareness of how the current nodes are performing, to show whether additional storage nodes are needed.
There are two primary indications that an additional metrics storage node should be deployed or that metrics collection schedules should be adjusted:
  • The JVM heap hits its maximum threshold and causes performance degradation.
  • The storage node begins using too much disk space on its system.
The heap size (along with other JVM parameters) are configurable per node and the disk space limits are relative to the system, but for large environments or intensive metrics collection, the storage node could still encounter hardware limits.
The storage node resource does not have configured metrics or alert definitions set on it directly, as with other resources. Rather, there are four alerts pre-defined for every storage node, to help determine when to add nodes thorugh automatic monitoring:
  • High heap usage, which can lead ot out of memory errors and performance degradation. A dampening rule is in place to prevent alerts for momentary memory spikes.
  • High disk usage, which can lead to problems with compaction and other routine operations.
    Compaction operations are particularly important because compaction merges datafiles on the disk into a single disk file. This frees disk space and improves read performance. If this operations fails, then performance can degrade.
    An alert is fired for high disk usage if any one of several conditions is met:
    • The size of the storage node data exceeds 50% of total disk space.
    • The overall amount of disk space used exceeds 75% of the total disk space (regardless of how much disk space the storage node is using).
    • The ratio of free disk space to storage node data is less than 1.5. This is calculated by taking the amount of free disk space divided by the disk space used by the storage node. If there is 50MB of free space, and the storage node is using 35MB of disk, then the ratio is 50/35 or 1.42. That is too low and would trigger an alert.
    A dampening rule is in place to prevent alerts for momentary usage spikes.
  • Snapshot failure, meaning a local routine backup operation has failed.
  • Maintenance operation failure, meaning either a deploy or undeploy operation for a node failed. Any underlying causes, like an unavailable resource, can be addressed and then the operation can be re-run.
Each of the predefined alerts is set against child resources for the storage node.

Table 9.1. Storage Resources for Alerts

Alert Parent Resource Resource Type Area to Address
High Heap Usage Cassandra Server JVM Memory Subsystem Edit the heap sizes in the storge node JVM configuration
High Disk Usage Database Management Services Storage Service Increase the disk space for the system hosting the node
Snapshot Failure Database Management Services Storage Service  
Maintenance Operation Failure   Storage Node Unavailable storage nodes in the cloud (which prevent updates)
To view the alerts for a storage cluster:
  1. Click the Administration tab in the top navigation bar.
  2. In the Topology area on the left, select the Storage Nodes item.
  3. The Nodes tab shows the number of unacknowledged alerts for each node.
  4. To view the list of alerts, open the Cluster Alerts tab.
    Every alert is listed with a description of the condition which triggered it, the affected resource, and the time of the alert.