Chapter 9. Monitoring
CloudForms 4.6 is capable of monitoring the health of an OpenShift Container Platform cluster by receiving alerts created by Prometheus, but if CloudForms is to accurately and reliably provide inventory, utilization metrics, reporting and automation, then its health should also be monitored to ensure reliable operation.
9.1. Monitoring of OpenShift Container Platform
CloudForms 4.6 can receive and process alerts generated by Prometheus in an OpenShift Container Platform 3.7 and later cluster.
It should be noted that Prometheus is supplied as a Technology Preview feature in OpenShift Container Platform 3.7 and 3.9, and that interaction with Prometheus monitoring is also a CloudForms 4.6 Technology Preview feature.
9.1.1. OpenShift Node Exporter
To be able to generate Prometheus alerts on node performance (such as CPU utilization), the node-exporter pod must run on each node. The following instructions describe how to create a new project for the node-exporter and install it onto each node:
oc adm new-project openshift-metrics-node-exporter ⏎
--node-selector='zone=default'
oc project openshift-metrics-node-exporter
oc create ⏎
-f https://raw.githubusercontent.com/openshift/origin/master/ ⏎
examples/prometheus/node-exporter.yaml ⏎
-n openshift-metrics-node-exporter
oc adm policy add-scc-to-user -z prometheus-node-exporter ⏎
-n openshift-metrics-node-exporter hostaccessThe label supplied with the --node-selector switch should be applied to all nodes in the cluster
9.1.2. Defining Prometheus Alerts
Alerts can be defined in Prometheus by editing the prometheus ConfigMap in the openshift-metrics project (as long as Prometheus has been installed in OpenShift Container Platform)
oc project openshift-metrics oc edit cm prometheus
Prometheus alerts to be consumed by CloudForms are defined with a "target" of either a node or a provider (the miqTarget annotation). Alert rules are defined within the context of a named rule group, for example the following alert in the custom_rules group will trigger if an OpenShift Container Platform node is down:
prometheus.rules: |
groups:
- name: custom-rules
interval: 30s # defaults to global interval
rules:
- alert: "Node Down"
expr: up{job="kubernetes-nodes"} == 0
annotations:
miqTarget: "ContainerNode"
url: "https://www.example.com/node_down_fixing_instructions"
description: "Node {{ $labels.instance }} is down"
labels:
severity: "error"Each rule has a number of parameters, as follows:
- alert: - the name of the alert as seen in the Prometheus and CloudForms Monitoring WebUI consoles
- expr: - the promQL expression to be evaluated to determine whether the alert should fire.[14]
9.1.2.1. Annotations
An alert can have several annotations, although the following three should always be defined:
- miqTarget: this should be "ContainerNode" or "ExtManagementSystem", and determines whether the alert should be associated with an individual node, or the entire OpenShift Container Platform provider.
- url: a URL to a web page containing a "Standard Operating Procedure" (SOP) describing how to fix the problem.
-
description: a more complete description of the alert. The description text can include substitution variables that can insert label values defined with the alert (for example
{{ $labels.<label_name> }}), or the value of a counter (for example{{ $value }}).
9.1.2.2. Labels
An alert can have several labels, although the following should always be defined:
- severity: the severity level of the alert within CloudForms. Valid levels are "error", "warning" or "info".
9.1.2.3. Triggering Prometheus to Reload its Configuration
Once the ConfigMap has been edited, the prometheus process in the Prometheus container must be triggered with a SIGHUP to read the configuration changes. This can be done using the following commands:
oc rsh -c prometheus prometheus-0 bash bash-4.2$ kill -SIGHUP 1 bash-4.2$ exit
9.1.2.4. Example Alerts
The following alert definition is useful to test whether the alerting functionality has been configured correctly. It should raise an alert if any of the OpenShift Container Platform nodes are up:
- alert: "Node up" # helpful for testing
expr: up{job="kubernetes-nodes"} == 1
annotations:
miqTarget: "ContainerNode"
url: "https://www.example.com/fixing_instructions"
description: "ContainerNode {{ $labels.instance }} is up"
labels:
severity: "info"The following alert definition should raise an alert if the number of pods running on any node exceeds the threshold defined in the promQL expression:
- alert: "Too Many Pods"
expr: sum(kubelet_running_pod_count) > 50
annotations:
miqTarget: "ExtManagementSystem"
url: "https://www.example.com/too_many_pods_fixing_instructions"
description: "Too many running pods"
labels:
severity: "error"The following alert definition should raise an alert if the number of authenticated logins to the OpenShift Container Platform console exceeds the threshold defined in the promQL expression:
- alert: "Too Many Requests"
expr: rate(authenticated_user_requests[2m]) > 20
annotations:
miqTarget: "ExtManagementSystem"
url: "https://www.example.com/too_many_requests_fixing_instructions"
description: "Too many authenticated login requests"
labels:
severity: "warning"The following alert definition uses metrics returned by the node-exporter, and should raise an alert if node CPU utilization exceeds the threshold defined in the promQL expression:
- alert: "Node CPU Usage"
expr: (100 - (avg by (instance) ⏎
(irate(node_cpu{app="prometheus-node-exporter", mode="idle"}[5m])) ⏎
* 100)) > 70
for: 30s
annotations:
miqTarget: "ExtManagementSystem"
url: "https://www.example.com/high_node_cpu_fixing_instructions"
description: "{{ $labels.instance }}: CPU usage is above 70% ⏎
(current value is: {{ $value }})"
labels:
severity: "warning"9.1.3. Alert Profiles
To enable CloudForms to consume and process the Prometheus alerts, a pre-defined Control Alert Profile must be assigned for each of the two alert target types (see Figure 9.1, “Prometheus Alert Profiles”).
Figure 9.1. Prometheus Alert Profiles

The "Prometheus Provider Profile" can be assigned to selected, tagged or all OpenShift Container Platform providers (see Figure 9.2, “Prometheus Provider Profile Assignments”).
Figure 9.2. Prometheus Provider Profile Assignments

The "Prometheus Node Profile" can be assigned to all OpenShift Container Platform nodes (see Figure 9.3, “Prometheus Node Profile Assignments”).
Figure 9.3. Prometheus Node Profile Assignments

Once the alert profiles have been assigned, any new Prometheus alerts will be visible in the CloudForms Monitor → Alerts WebUI page (see Figure 9.4, “Prometheus Alerts Visible in CloudForms”)
Figure 9.4. Prometheus Alerts Visible in CloudForms

9.2. Monitoring of CloudForms
A large CloudForms deployment has many interacting components: several appliances or pods, many worker processes, a message queue and a PostreSQL database.
The VMDB and CFME worker appliances or pods within a region have different monitoring requirements, as described below.
9.2.1. Database Appliance or Pod
The PostgreSQL database can become a performance bottleneck for the CloudForms region if it is not performing optimally. The following items should be regularly monitored:
- VMDB disk space utilization - monitor and forecast when 80% of filesystem will become filled. Track actual disk consumption versus expected consumption
- CPU utilization. A steady state utilization approaching 80% may indicate that VMDB appliance scaling or region redesign is required
Memory utilization, especially swap usage
- Increase appliance memory if swapping is occurring. Increase pod memory
- I/O throughput - use the sysstat or iotop tools to monitor I/O utilization, throughput, and I/O wait state processing
Monitor the miq_queue table
Number of entries
- Check for signs of event storm: messages with role = 'event' and class_name = 'EmsEvent'
- Number of messages in a "ready" state
- Check that the maximum number of configured connections is not exceeded
- Ensure that the database maintenance scripts run regularly
9.2.2. CFME 'Worker' Appliances or Pods
Operational limits for "worker" appliances are usually established on a per-appliance basis, and depend on the enabled server roles and number of worker processes. Resource limits for the "cloudforms-backend" StatefulSet replicas must be defined in the YAML definition for the StatefulSet, and so should be set at the maximum level that any of the replicas will require.
The following items should typically be monitored:
9.2.2.1. General Appliance/Pod
- CPU and memory utilization
- Check for message timeouts
9.2.2.2. Workers
Review rates and reasons for worker process restarts
- Increase allocated memory if workers are exceeding memory thresholds
- Validate that the primary/secondary roles for workers in zones and region are as expected, and force a role failover if necessary
9.2.2.2.1. Provider Refresh
Review EMS refresh activity
- How many refreshes per hour?
How long does a refresh take per OpenShift Container Platform cluster?
- Data extraction component
- Database load component
Are refresh times consistent throughout the day?
- What is causing periodic slowdowns?
- Are certain property changes triggering too many refreshes?
9.2.2.2.2. Capacity & Utilization
Are any realtime metrics being lost?
- Long message dequeue times
- Missing data samples
How long does metric collection take?
- Data extraction component
- Database load component
Are rollups completing in time?
- Confirm expected daily and hourly records for each container, pod, etc.
- Validate the numbers of Data Collector and Data Processor workers
9.2.2.2.3. Automate
Are any requests staying in a "pending" state for a long time?
- Validate the number of Generic workers
- Check for state machine retries or timeouts exceeded
9.2.2.2.4. Event Handling
- Monitor the utilization of CFME appliances or pods with the Event Monitor role enabled
- Validate the memory allocated to Event Monitor workers
9.2.2.2.5. SmartState Analysis
- Monitor utilization of CFME appliances or pods with the SmartProxy role enabled when scheduled scans are running
- Review scan failures or aborts
- Validate the number of SmartProxy workers
9.2.2.2.6. Reporting
- Monitor utilization of appliances with Reporting role enabled when periodic reports are running.
- Validate the number of Reporting workers
9.2.3. Control Alerts
Some self-protection policies are available out-of-the-box in the form of control alerts. The following CFME Operation alert types are available:
9.2.3.1. Server Alerts
- EVM Server Database Backup Insufficient Space
- EVM Server Exceeded Memory Limit
- EVM Server High /boot Disk Usage
- EVM Server High /home Disk Usage
- EVM Server High /tmp Disk Usage
- EVM Server High /var Disk Usage
- EVM Server High /var/log Disk Usage
- EVM Server High /var/log/audit Disk Usage
- EVM Server High DB Disk Usage
- EVM Server High System Disk Usage
- EVM Server High Temp Storage Disk Usage
- EVM Server Not Responding
- EVM Server Start
- EVM Server Stop
- EVM Server is Master
9.2.3.2. Worker Alerts
- EVM Worker Exceeded Memory Limit
- EVM Worker Exceeded Uptime Limit
- EVM Worker Exit File
- EVM Worker Killed
- EVM Worker Not Responding
- EVM Worker Started
- EVM Worker Stopped
Each alert type is configurable to send an email, an SNMP trap, or run an automate instance (see Figure 9.5, “Defining a "Server Stopped" Alert”).
Figure 9.5. Defining a "Server Stopped" Alert

EVM Worker Exceeded Uptime Limit, EVM Worker Started and EVM Worker Stopped events are normal occurrences and should not be considered cause for alarm
An email sent by one of these alerts will have a subject such as:
Alert Triggered: EVM Worker Killed, for (MIQSERVER) cfmesrv06.
The email body will contain text such as the following:
Alert 'EVM Worker Killed', triggered Event: Alert condition met Entity: (MiqServer) cfmesrv06
To determine more information - such as the actual worker type that was killed - it may be necessary to search evm.log on the appliance mentioned.
9.3. Consolidated Logging
The distributed nature of the worker/message architecture means that it is often impossible to predict which CFME appliance will run a particular action. This can add to the troubleshooting challenge of examining log files, as the correct appliance hosting the relevant log file must first be located.
For the podified deployment of CloudForms the logs are written in JSON format to STDOUT in the containers, so Fluentd running on the OpenShift Container Platform nodes will forward them to Elasticsearch automatically (if EFK has been deployed in the cluster).
Although there is no out-of-the-box consolidated logging architecture for the VM appliance version of CloudForms 4.6, it is possible to add CloudForms logs as a source to an external ELK/EFK stack. This can bring a number of benefits, and greatly simplifies the task of log searching in a CloudForms deployment comprising many CFME appliances.

Where did the comment section go?
Red Hat's documentation publication system recently went through an upgrade to enable speedier, more mobile-friendly content. We decided to re-evaluate our commenting platform to ensure that it meets your expectations and serves as an optimal feedback mechanism. During this redesign, we invite your input on providing feedback on Red Hat documentation via the discussion platform.