Chapter 6. Capacity & Utilization
CloudForms 4.6 can retrieve Capacity & Utililization (C&U) metrics from either Hawkular or Prometheus endpoints in an OpenShift Container Platform environment. Metrics are retrieved for nodes, pods and containers.
C&U metrics collection of an OpenShift Container Platform cluster should not be deployed/enabled on CloudForms Management Engine (CFME) 5.9.0 or 5.9.1 if Hawkular is used as the OpenShift Container Platform metrics endpoint.
C&U metrics collection from Hawkular can be enabled on CFME 5.9.2 or later, but the :hawkular_force_legacy: value should be set to true in Configuration → Advanced settings, i.e.
:queue_worker_base:
:ems_metrics_collector_worker:
...
:ems_metrics_collector_worker_kubernetes:
...
:hawkular_force_legacy: trueIt should also be noted that Prometheus is supplied as a Technology Preview feature in OpenShift Container Platform 3.7 and 3.9, and that interaction with Prometheus for metrics collection is also a CloudForms 4.6 Technology Preview feature.
6.1. Component Parts
As discussed in Chapter 2, CloudForms Architecture, there are three CFME appliance roles connected with C&U processing:
- Capacity & Utilization Coordinator
- Capacity & Utilization Data Collector
- Capacity & Utilization Data Processor
6.1.1. C&U Coordination
Every 3 minutes a message is queued for the C&U Coordinator to begin processing[11]. The Coordinator schedules the data collections for the VMDB objects that support metrics collection, and queues messages for the C&U Data Collector to retrieve metrics for any objects for which a collection is due. For an OpenShift Container Platform provider, metrics are retrieved for nodes, pods and containers.
The time interval between collections is defined by the :capture_threshold setting in the Configuration → Advanced settings, as follows:
:performance:
:capture_threshold:
:default: 10.minutes
:ems_cluster: 50.minutes
:host: 50.minutes
:storage: 60.minutes
:vm: 50.minutes
:container: 50.minutes
:container_group: 50.minutes
:container_node: 50.minutes
:capture_threshold_with_alerts:
:default: 1.minutes
:host: 20.minutes
:vm: 20.minutes
The intention is that no metric is ever older than the :capture_threshold value for its object type. The default capture_threshold for OpenShift Container Platform pods, containers & nodes is 50 minutes. On average therefore messages for approximately 6% ( 100 / (50/3) ) of the total number of these types of managed objects are created every 3 minutes.
As an example, if CloudForms is managing an OpenShift Container Platform cluster containing 1000 pods, 1000 containers and 20 nodes, there will be approximately 2020/16.666 or 121 messages created every 3 minutes. In practice the number varies slightly at each C&U Coordinator run, and for this example might vary between approximately 110 and 130 as greater or fewer objects fall outside of the :capture_threshold timeout value.
The message created by the C&U Coordinator specifies a counter type to retrieve ("realtime" or "hourly"), and an optional time range to collect data for.
6.1.2. Data Collection
The data collection phase of C&U processing is split into two parts: capture, and initial processing and storage, both performed by the C&U Data Collector.
6.1.2.1. Capture
Upon dequeue of a new message the Data Collector makes a connection to Hawkular or Prometheus (depending on configuration setting for the provider) to retrieve the data for the object and time range specified in the message.
A successful capture is written to evm.log, as follows:
Capture for ManageIQ::Providers::Kubernetes::ContainerManager::ContainerNode ⏎
name: [master1.cloud.example.com], ⏎
id: [1000000000019], ⏎
start_time: [2018-01-23 08:04:00 UTC]...Complete ⏎
- Timings: { ⏎
:capture_state=>0.014913082122802734, ⏎
:collect_data=>1.4922049045562744, ⏎
:total_time=>1.8736591339111328}6.1.2.2. Initial Processing & Storage
The realtime data retrieved from the metrics source is stored in the VMDB in the metrics table, and in one of 24 sub-tables called metrics_00 to metrics_23 (based on the timestamp, each table corresponds to an hour). Dividing the records between sub-tables simplifies some of the data processing tasks. Once the data is stored, the Data Collector queues messages to the Data Processor to perform the hourly, daily and parental rollups.
The successful completion of this initial processing stage can be seen in evm.log, as follows:
Processing for ManageIQ::Providers::Kubernetes::ContainerManager::ContainerNode ⏎
name: [master1.cloud.example.com], ⏎
id: [1000000000019], ⏎
for range [2018-01-23 08:03:00 UTC - 2018-01-23 09:42:00 UTC]...Complete ⏎
- Timings: { ⏎
:process_counter_values=>0.0027306079864501953, ⏎
:get_attributes=>0.001895904541015625, ⏎
:db_find_prev_perfs=>0.008209705352783203, ⏎
:preload_vim_performance_state_for_ts=>0.0017957687377929688, ⏎
:process_perfs=>0.01605057716369629, ⏎
:process_build_ics=>0.003462076187133789, ⏎
:process_perfs_db=>0.11580920219421387, ⏎
:total_time=>0.16924238204956055}6.1.3. Data Processing
The C&U Data Processors periodically perform the task of 'rolling up' the realtime data. Rollups are performed hourly and daily, and counters for more granular objects such as containers are aggregated into the counters for their parent objects. For example for an OpenShift Container Platform provider the parent rollup process would include the following objects:
- Active & recently deleted Pods {hourly,daily} → Project
- Active Pods {hourly,daily} → Replicator
- Active Pods {hourly,daily} → Service
- Node {hourly,daily} → Provider {hourly,daily} → Region {hourly,daily} → Enterprise
Rollup data is stored in the metrics_rollups table and in one of 12 sub-tables called metric_rollups_01 to metric_rollups_12 (each table corresponds to a month).
Additional analysis is performed on the hourly rollup data to identify bottlenecks, calculate chargeback metrics, and determine normal operating range and right-size recommendations. The completion of a successful rollup is written to evm.log, as follows:
Rollup for ManageIQ::Providers::Kubernetes::ContainerManager::ContainerNode ⏎
name: [master1.cloud.example.com], ⏎
id: [1000000000019] ⏎
for time: [2018-01-23T09:00:00Z]...Complete ⏎
- Timings: { ⏎
:db_find_prev_perf=>0.0028629302978515625, ⏎
:rollup_perfs=>0.018657922744750977, ⏎
:db_update_perf=>0.005361080169677734, ⏎
:process_bottleneck=>0.0017886161804199219, ⏎
:total_time=>0.037772417068481445}6.2. Data Retention
Capacity and Utilization data is not retained indefinitely in the VMDB. By default hourly and daily rollup data is kept for 6 months after which it is purged, and realtime data samples are purged after 4 hours. These retention periods for C&U data are defined in the :performance section of the Configuration → Advanced settings, as follows:
:performance:
...
:history:
...
:keep_daily_performances: 6.months
:keep_hourly_performances: 6.months
:keep_realtime_performances: 4.hours6.3. Challenges of Scale
The challenges of scale for capacity & utilization are related to the time constraints involved when collecting and processing the data for several thousand objects in fixed time periods, for example:
- Retrieving realtime counters before they are deleted from the EMS
- Rolling up the realtime counters before the records are purged from the VMDB
- Inter-worker message timeout
When capacity & utilization is not collecting and processing the data consistently, other CloudForms capabilities that depend on the metrics - such as chargeback or rightsizing - become unreliable.
The challenges are addressed by adding concurrency - scaling out both the data collection and processing workers - and by keeping each step in the process as short as possible to maximise throughput.
6.4. Monitoring Capacity & Utilization Performance
As with EMS refresh, C&U data collection has two significant phases that each contribute to the overall performance:
Extracting and parsing the metrics from Hawkular or Prometheus
- Network latency to the Hawkular or Prometheus pod
- Time waiting for Hawkular or Prometheus to process the request and return data
- CPU cycles performing initial processing
Storing the data into the VMDB
- Network latency to the database
- Database appliance CPU, memory and I/O resources
The line printed to evm.log at the completion of each stage of the operation contains detailed timings, and these can be used to determine bottlenecks. The typical log lines for C&U capture and initial processing can be parsed using a script such as perf_process_timings.rb.[12], for example:
Capture timings: capture_state: 0.018490 seconds collect_data: 1.988341 seconds total_time: 2.043008 seconds Process timings: process_counter_values: 0.002075 seconds get_attributes: 0.000989 seconds db_find_prev_perfs: 0.009578 seconds preload_vim_performance_state_for_ts: 0.002601 seconds process_perfs: 0.014115 seconds process_build_ics: 0.002263 seconds process_perfs_db: 0.049842 seconds total_time: 0.120220 seconds
C&U data processing is purely a CPU and database-intensive activity. The rollup timings can be extracted from evm.log in a similar manner
Rollup timings: db_find_prev_perf: 0.017392 seconds rollup_perfs: 0.149081 seconds db_update_perf: 0.072613 seconds process_operating_ranges: 0.081201 seconds total_time: 0.320613 seconds
6.5. Identifying Capacity and Utilization Problems
The detailed information written to evm.log can be used to identify problems with capacity and utilization
6.5.1. Coordinator
With a very large number of managed objects the C&U Coordinator becomes unable to create and queue all of the required perf_capture_realtime messages within its own message timeout period of 600 seconds. An indeterminate number of managed objects will have no collections scheduled for that time interval. An extraction of lines from evm.log that illustrates the problem is as follows:
... INFO -- : MIQ(MiqGenericWorker::Runner#get_message_via_drb) ⏎ Message id: [10000221979280], MiqWorker id: [10000001075231], ⏎ Zone: [OCP], Role: [ems_metrics_coordinator], Server: [], ⏎ Ident: [generic], Target id: [], Instance id: [], Task id: [], ⏎ Command: [Metric::Capture.perf_capture_timer], Timeout: [600], ⏎ Priority: [20], State: [dequeue], Deliver On: [], Data: [], ⏎ Args: [], Dequeued in: [2.425676767] seconds ... INFO -- : MIQ(Metric::Capture.perf_capture_timer) Queueing ⏎ performance capture... ... INFO -- : MIQ(MiqQueue.put) Message id: [10000221979391], ⏎ id: [], Zone: [OCP], Role: [ems_metrics_collector], Server: [], ⏎ Ident: [openshift_enterprise], Target id: [], ⏎ Instance id: [10000000000113], Task id: [], ⏎ Command: [ManageIQ::Providers::Kubernetes::ContainerManager:: ⏎ ContainerNode.perf_capture_realtime], Timeout: [600], ⏎ Priority: [100], State: [ready], Deliver On: [], Data: [], ⏎ Args: [2017-03-23 20:59:00 UTC, 2017-03-24 18:33:23 UTC] ... ... INFO -- : MIQ(MiqQueue.put) Message id: [10000221990773], ⏎ id: [], Zone: [OCP], Role: [ems_metrics_collector], Server: [], ⏎ Ident: [openshift_enterprise], Target id: [], ⏎ Instance id: [10000000032703], Task id: [], ⏎ Command: [ManageIQ::Providers::Kubernetes::ContainerManager:: ⏎ ContainerGroup.perf_capture_realtime], Timeout: [600], ⏎ Priority: [100], State: [ready], Deliver On: [], Data: [], ⏎ Args: [2017-03-24 18:10:20 UTC, 2017-03-24 18:43:15 UTC] ... ERROR -- : MIQ(MiqQueue#deliver) Message id: [10000221979280], ⏎ timed out after 600.002976954 seconds. Timeout threshold [600]
Such problems can be detected by looking for message timeouts in the log using a command such as the following:
egrep "Message id: \[\d+\], timed out after" evm.log
Any lines matched by this search can be traced back using the PID field in the log line to determine the operation that was in process when the message timeout occurred.
6.5.2. Data Collection
There are several types of log line written to evm.log that can indicate C&U data collection problems.
6.5.2.1. Messages Still Queued from Last C&U Coordinator Run
Before the C&U Coordinator starts queueing new messages it calls an internal method perf_capture_health_check that prints the number of capture messages still queued from previous C&U Coordinator schedules. If the C&U Data Collectors are keeping pace with the rate of message additions, there should be approximately 0 messages remaining in the queue when the C&U Coordinator runs. If the C&U Data Collectors are not dequeuing and processing messages quickly enough there will be a backlog.
Searching for the string "perf_capture_health_check" on the CFME appliance with the active C&U Coordinator role will show the state of the queue before the C&U Coordinator adds further messages, and any backlog will be visible.
... INFO -- : MIQ(Metric::Capture.perf_capture_health_check) ⏎ 520 "realtime" captures on the queue for zone [OCP Zone] - ⏎ oldest: [2016-12-13T07:14:44Z], recent: [2016-12-13T08:02:32Z] ... INFO -- : MIQ(Metric::Capture.perf_capture_health_check) ⏎ 77 "hourly" captures on the queue for zone [OCP Zone] - ⏎ oldest: [2016-12-13T08:02:15Z], recent: [2016-12-13T08:02:17Z] ... INFO -- : MIQ(Metric::Capture.perf_capture_health_check) ⏎ 0 "historical" captures on the queue for zone [OCP Zone]
6.5.2.2. Long Dequeue Times
Searching for the string "MetricsCollectorWorker::Runner#get_message_via_drb" will show the log lines printed when the C&U Data Collector messages are dequeued. Long dequeue times (over 600 seconds) indicate that the number of C&U Data Collectors should be increased.
... MIQ(ManageIQ::Providers::Openshift::ContainerManager:: ⏎ MetricsCollectorWorker::Runner#get_message_via_drb) ⏎ Message id: [1476318], MiqWorker id: [2165], Zone: [default], ⏎ Role: [ems_metrics_collector], Server: [], Ident: [openshift], ⏎ Target id: [], Instance id: [1475], Task id: [], Command: [ManageIQ::Providers::Kubernetes::ContainerManager:: ⏎ Container.perf_capture_realtime], Timeout: [600], Priority: [100], ⏎ State: [dequeue], Deliver On: [], Data: [], Args: [], ⏎ Dequeued in: [1576.125461466] seconds
6.5.2.3. Missing Data Samples
Searching for the string "expected to get data" can reveal whether requested data sample points were not available for retrieval from Hawkular, as follows:
... WARN -- : MIQ(ManageIQ::Providers::Kubernetes::ContainerManager:: ⏎ ContainerGroup#perf_capture) [realtime] For ⏎ ManageIQ::Providers::Kubernetes::ContainerManager::ContainerGroup ⏎ name: [jenkins-1], id: [2174], start_time: [2017-08-11 00:00:00 UTC], ⏎ expected to get data as of [2017-08-11T00:00:20Z], but got data as ⏎ of [2017-08-11 14:28:20 UTC]
6.5.3. Data Processing
The rollup and associated bottleneck and performance processing of the C&U data is less time sensitive, although must still be completed in the 4 hour realtime performance data retention period.
With a very large number of managed objects and insufficient worker processes, the time taken to process the realtime data can exceed the 4 hour period, meaning that that data is lost. The time taken to process the hourly rollups can exceed an hour, and the rollup process never keeps up with the rate of messages.
The count of messages queued for processing by the Data Processor can be extracted from evm.log, as follows:
grep 'count for state=\["ready"\]' evm.log | ⏎ egrep -o "\"ems_metrics_processor\"=>[[:digit:]]+" "ems_metrics_processor"=>16612 "ems_metrics_processor"=>16494 "ems_metrics_processor"=>12073 "ems_metrics_processor"=>12448 "ems_metrics_processor"=>13015 ...
The "Dequeued in" and "Delivered in" times for messages processed by the MiqEmsMetricsProcessorWorkers can be used as guidelines for overall throughput, for example:
... MIQ(MiqEmsMetricsProcessorWorker::Runner#get_message_via_drb) ⏎ Message id: [1000003602714], MiqWorker id: [1000000003176], Zone: [default], ⏎ Role: [ems_metrics_processor], Server: [], Ident: [ems_metrics_processor], ⏎ Target id: [], Instance id: [1000000000141], Task id: [], ⏎ Command: [ContainerService.perf_rollup], Timeout: [1800], Priority: [100], ⏎ State: [dequeue], Deliver On: [2018-02-15 10:00:00 UTC], Data: [], ⏎ Args: ["2018-02-15T09:00:00Z", "hourly"], Dequeued in: [1.982437188] seconds ... INFO -- : MIQ(MiqQueue#delivered) Message id: [1000003602714], ⏎ State: [ok], Delivered in [0.042801554] seconds
When C&U is operating correctly, for each time-profile instance there should be one daily record and at least 24 hourly records for each active OpenShift Container Platform entity such as service, pod or container. There should also be at most 5 of the metrics_## tables that contain more than zero records.
The following SQL query can be used to confirm that the records are being processed correctly:
select resource_id, date_trunc('day',timestamp) as collect_date, ⏎
resource_type, capture_interval_name, count(*)
from metric_rollups
where resource_type like '%Container%'
group by resource_id, collect_date, resource_type, capture_interval_name
order by resource_id, collect_date, resource_type, capture_interval_name, count
;
._id | collect_date | resource_type | capture_int... | count
-----+---------------------+---------------------+----------------+-------
...
89 | 2018-01-26 00:00:00 | ContainerService | daily | 1
89 | 2018-01-26 00:00:00 | ContainerService | hourly | 24
89 | 2018-01-27 00:00:00 | ContainerReplicator | daily | 1
89 | 2018-01-27 00:00:00 | ContainerReplicator | hourly | 24
...
2050 | 2018-02-01 00:00:00 | ContainerGroup | daily | 1
2050 | 2018-02-01 00:00:00 | ContainerGroup | hourly | 24
2050 | 2018-02-02 00:00:00 | ContainerGroup | daily | 1
2050 | 2018-02-02 00:00:00 | ContainerGroup | hourly | 24
...
2332 | 2018-01-29 00:00:00 | Container | daily | 1
2332 | 2018-01-29 00:00:00 | Container | hourly | 24
2332 | 2018-01-30 00:00:00 | Container | daily | 1
2332 | 2018-01-30 00:00:00 | Container | hourly | 24
...6.6. Recovering From Capacity and Utilization Problems
If C&U realtime data is not collected it is generally lost. Some historical information is retrievable using C&U gap collection (see Figure 6.1, “C&U Gap Collection”), but this is of a lower granularity than the realtime metrics that are usually collected. Although gap collection is intended for use with VMware providers, it also works in a more limited capacity with the OpenShift Container Platform provider.
Figure 6.1. C&U Gap Collection

6.7. Tuning Capacity and Utilization
Tuning capacity and utilization generally involves ensuring that the VMDB is running optimally, and adding workers and CFME appliances to scale out the processing capability.
6.7.1. Scheduling
Messages for the ems_metrics_coordinator (C&U coordinator) server role are processed by a Generic or Priority worker. These workers also process automation messages, which are often long-running. For larger CloudForms installations it can be beneficial to separate the C&U Coordinator and Automation Engine server roles onto different CFME appliances.
6.7.2. Data Collection
The metrics_00 to metrics_23 VMDB tables have a high rate of insertions and deletions, and benefit from regular reindexing. The database maintenance scripts run a /usr/bin/hourly_reindex_metrics_tables script that reindexes one of the tables every hour.
6.7.2.1. Increasing the Number of Data Collectors
Ideally each new batch of ems_metrics_collector messages should be fully processed in the 3 minute time window before the next batch of messages is created. If the ems_metrics_collector message queue length were plotted against time, it should look like a saw tooth with a sudden rise in the number of queued messages every 3 minutes, followed by a gradual decline down to zero as the data collectors dequeue and process the messages (see Figure 6.2, “ems_metrics_collector Message Queue Length Stable over Time”)
Figure 6.2. ems_metrics_collector Message Queue Length Stable over Time

If the number of C&U Data Collector workers can’t keep up with the rate that messages are added to the queue, the queue length will just rise indefinitely (see Figure 6.2, “ems_metrics_collector Message Queue Length Stable over Time”)
Figure 6.3. ems_metrics_collector Message Queue Length Increasing over Time

If the ems_metrics_collector message queue length is steadily increasing, the number of C&U Data Collector workers should be increased. The default number of workers per appliance or pod is 2, but this can be increased up to a maximum of 9. Consideration should be given to the additional CPU and memory requirements that an increased number of workers will place on the appliance or pod. It may be more appropriate to add further appliances or cloudforms-backend StatefulSet replicas and scale horizontally.
For larger CloudForms installations it can be beneficial to separate the C&U Data Collector and Automation Engine server roles onto different CFME appliances, as both are resource intensive. Very large CloudForms installations (managing many thousands of objects) may benefit from dedicated CFME appliances in the provider zones exclusively running the C&U data collector role.
6.7.3. Data Processing
If C&U data processing is taking too long to process the rollups for all objects, the number of C&U Data Processor workers can be increased from the default of 2 up to a maximum of 4 per appliance. As before, consideration should be given to the additional CPU and memory requirements that an increased number of workers will place on an appliance. Adding further CFME appliances or "cloudforms-backend" StatefulSet replicas to the zone may be more appropriate.
For larger CloudForms installations it can be beneficial to separate the C&U Data Processor and Automation Engine server roles onto different CFME appliances or pods, as both are resource intensive. CloudForms installations managing several thousand objects may benefit from dedicated CFME appliances or "cloudforms-backend" StatefulSet replicas in the provider zones exclusively running the C&U Data Processor role.

Where did the comment section go?
Red Hat's documentation publication system recently went through an upgrade to enable speedier, more mobile-friendly content. We decided to re-evaluate our commenting platform to ensure that it meets your expectations and serves as an optimal feedback mechanism. During this redesign, we invite your input on providing feedback on Red Hat documentation via the discussion platform.