Chapter 2. Architecture

In order to understand how to deploy CloudForms at scale, it is important to understand the architectural components that affect the design and deployment decisions. These principal components are described in this chapter.

2.1. Appliances

To simplify installation, the Red Hat CloudForms product is distributed as a self-contained virtual machine template, which when cloned becomes a CloudForms Management Engine (CFME) appliance. Each release of the Red Hat CloudForms product since v2.0 has had a corresponding CloudForms Management Engine release, although the version numbers are not the same (for historical reasons). The following table summarizes the relative CFME and CloudForms product versions.

Table 2.1. Summary of the relative CFME and CloudForms product versions

CloudForms Management Engine versionCloudForms (Product) version

5.1

2.0

5.2

3.0

5.3

3.1

5.4

3.2

5.5

4.0

5.6

4.1

5.7

4.2

5.8

4.5

A CFME 5.8 (CloudForms 4.5) appliance runs Red Hat Enterprise Linux 7.3, with PostgreSQL 9.5, Rails 5.0.2, the CloudForms evmserverd service, and all associated Ruby gems installed. A new addition with CFME 5.8 is the Embedded Ansible 3.1 automation manager, also packaged with the appliance.

The self-contained nature of appliances makes them ideal for horizontally scaling a CloudForms deployment to handle the increased load that larger clouds or virtual infrastructures present.

Appliances are downloadable as images or templates in formats suitable for VMware, Red Hat Virtualization, OpenStack, Amazon EC2, Microsoft’s System Center Virtual Machine Manager or Azure cloud, and Google Compute Engine. The most recent versions can also be installed as an OpenShift 3.x container image (although this platform is a technology preview).

2.2. Database

A CloudForms region stores all of its data in a PostgreSQL database. This is known as the Virtual Management Database or VMDB, although the terms "database" and "VMDB" are often used interchangeably. The database can be internal and integral with an appliance running several other roles (typical for smaller CloudForms deployments), but for larger CloudForms deployments it is typically a dedicated database server or cluster configured for high availability and disaster recovery.

2.3. Application

CloudForms is a Ruby on Rails application. The main miq_server.rb Rails application is supported by a number of worker processes that perform the various interactions with managed systems, or collect and analyse data.

2.4. Providers

CloudForms manages each cloud, container or virtual environment using modular subcomponents called providers. Each provider contains the classes and modules required to connect to and manage its specific target platform, and this provider specialization enables common functionality to be abstracted by provider type or class. CloudForms acts as a "manager of managers", and in keeping with this concept, providers communicate with their respective underlying cloud or infrastructure platform manager (such as vCenter server or RHV-M) using the native APIs published for the platform manager. A provider’s platform manager is referred to as an External Management System or EMS.

Note

Although the terms provider and external management system (EMS) are often used interchangeably, there is an important distinction. The provider is the CloudForms component, whereas the EMS is the managed entity that the provider connects to, such as the VMware vCenter

Providers are broadly divided into categories, and in CloudForms 4.5 these are Cloud, Infrastructure, Container, Configuration Management, Automation, Network, Middleware and Storage.[1]

2.4.1. Provider Namespaces

Many provider components are named according to a name-spacing schema that follows the style of:

ManageIQ::Providers::<ProviderName>::<ProviderCategory>

Some examples of this are as follows:

  • ManageIQ::Providers::EmbeddedAnsible::AutomationManager
  • ManageIQ::Providers::OpenshiftEnterprise::ContainerManager
  • ManageIQ::Providers::Openstack::CloudManager
  • ManageIQ::Providers::Openstack::InfraManager
  • ManageIQ::Providers::Azure::NetworkManager
  • ManageIQ::Providers::StorageManager::CinderManager
  • ManageIQ::Providers::Vmware::InfraManager

2.5. Server Roles

A CloudForms Management Engine 5.8 appliance can be configured to run up to 19 different server roles. These are enabled or disabled in the server Configuration section of the WebUI (see Figure 2.1, “Server Roles”).

Figure 2.1. Server Roles

Screenshot


Server roles are implemented by worker processes (see Section 2.6, “Workers”), many of which receive work instructions from messages (see Section 2.7, “Messages”).

2.5.1. Automation Engine

The Automation Engine role enables a CFME appliance to process queued automation tasks.[2]. There should be at least one CFME appliance with this role set in each zone. The role does not have a dedicated worker, automate tasks are processed by either a MiqGenericWorker or a MiqPriorityWorker, depending on message priority.

Note

The Automation Engine also handles the processing of events through the automate event switchboard

2.5.2. Capacity and Utilization

Capacity and utilization (C&U) metrics processing is a relatively resource-intensive operation, and there are three roles associated with its operation.

  • The Capacity & Utilization Coordinator role acts as a scheduler for the collection of C&U data in a zone, and queues work for the Capacity and Utilization Data Collector. If more than one CFME appliance in a zone has this role enabled, only one will be active at a time. This role does not have a dedicated worker, the C&U Coordinator tasks are processed by either a MiqGenericWorker or a MiqPriorityWorker, depending on message priority.
  • The Capacity & Utilization Data Collector performs the actual collection of C&U data. This role has a dedicated worker, and there is no limit to the number of concurrent workers in a zone. Enabling this role starts the provider-specific data collector workers for any providers in the appliance’s zone. For example a CFME appliance in a zone configured with a Red Hat Virtualization provider would contain one or more ManageIQ::Providers::Redhat::InfraManager::MetricsCollectorWorker processes if the C&U Data Collector server role was enabled.
  • The Capacity & Utilization Data Processor processes all of the data collected, allowing CloudForms to create charts, display utilization statistics, etc.. This role has a dedicated worker called the MiqEmsMetricsProcessorWorker, and there is no limit to the number of concurrent workers in a zone.
Note

The Capacity & Utilization roles are described in more detail in Chapter 6, Capacity & Utilization

2.5.3. Database Operations

The Database Operations role enables a CFME appliance to run certain database maintenance tasks such as purging old metrics. This role does not have a dedicated worker, the database operations tasks are processed by a MiqGenericWorker.

2.5.4. Embedded Ansible

The Embedded Ansible role enables the use of the built-in Ansible automation manager, which allows Ansible playbooks to be run from service catalogs, or from control actions and alerts. If more than one CFME appliance in a region has this role enabled, only one will be active at a time. This role has a dedicated worker called the EmbeddedAnsibleWorker, but enabling the role also starts the following event catcher and refresh workers:

  • ManageIQ::Providers::EmbeddedAnsible::AutomationManager::EventCatcher
  • ManageIQ::Providers::EmbeddedAnsible::AutomationManager::RefreshWorker
Note

Enabling the Embedded Ansible role adds approximately 2GBytes to the memory requirements of a CFME appliance

2.5.5. Event Monitor

The Event Monitor role is responsible for detecting and processing provider events such as a VM starting or stopping, a cloud instance being created, or a hypervisor rebooting. Enabling the role starts at least 2 workers; one or more provider-specific, and one common event handler.

The provider-specific event catcher maintains a connection to a provider’s event source (such as the Google Cloud Pub/Sub API for Google Compute Engine) and detects or 'catches' events and passes them to the common event handler. An event catcher worker is started for each provider in the appliance’s zone; a zone containing a VMware provider would contain a ManageIQ::Providers::Vmware::InfraManager::EventCatcher worker, for example.

Some cloud providers automatically add several types of manager, and these might each have an event catcher worker. To illustrate this, enabling the event monitor role on an appliance in an OpenStack Cloud provider zone would start the following event catcher workers:

  • ManageIQ::Providers::Openstack::CloudManager::EventCatcher
  • ManageIQ::Providers::Openstack::NetworkManager::EventCatcher
  • ManageIQ::Providers::StorageManager::CinderManager::EventCatcher

The event handler worker, called MiqEventHandler, is responsible for feeding the events from all event catchers in the zone into the automation engine’s event switchboard for processing.

There should be at least one CFME appliance with the event monitor role set in any zone containing a provider, however if more than one CFME appliance in a zone has this role, only one will be active at a time.

Note

The event catcher and event handler workers are described in more detail in Chapter 9, Event Handling

2.5.6. Git Repositories Owner

A CFME appliance with the Git Repositories Owner role enabled is responsible for synchronising git repository data from a git source such as Github or Gitlab, and making it available to other appliances in the region that have the automation engine role set. The git repository data is copied to /var/www/miq/vmdb/data/git_repos/<git_profile_name>/<git_repo_name> on the CFME appliance. This role does not have a dedicated worker.

2.5.7. Notifier

The Notifier role should be enabled if CloudForms is required to forward SNMP traps to a monitoring system, or to send e-mails. These might be initiated by an automate method or from a control policy, for example.

If more than one CFME appliance in a region has this role enabled, only one will be active at a time. This role does not have a dedicated worker, notifications are processed by either a MiqGenericWorker or a MiqPriorityWorker, depending on message priority.

2.5.8. Provider Inventory

The Provider Inventory role is responsible for refreshing provider inventory data for all provider objects such as virtual machines, hosts, clusters, tenants, or orchestration templates. It is also responsible for capturing datastore file lists. If more than one CFME appliance in a zone has this role enabled, only one will be active at a time.

Setting this role starts the provider-specific refresh workers for any providers in the appliance’s zone; a zone containing a RHV provider would contain a ManageIQ::Providers::Redhat::InfraManager::RefreshWorker worker, for example.

VMware providers add an additional MiqEmsRefreshCoreWorker, while cloud providers that use several types of manager add a worker per manager. For example enabling the Provider Inventory role on an appliance in an Azure provider zone would start the following Refresh workers:

  • ManageIQ::Providers::Azure::CloudManager::RefreshWorker
  • ManageIQ::Providers::Azure::NetworkManager::RefreshWorker
Note

Provider Inventory refresh workers are described in more detail in Chapter 5, Inventory Refresh

2.5.9. Provider Operations

A CFME appliance with the Provider Operations role performs certain managed object operations such as stop, start, suspend, shutdown guest, clone, reconfigure, etc., to provider objects such as VMs. These operations might be initiated from the WebUI, from Automate, or from a REST call. It also handles some storage-specific operations such as creating cloud volume snapshots. The role does not have a dedicated worker, provider operations tasks are processed by either a MiqGenericWorker or a MiqPriorityWorker, depending on message priority. There is no limit to the number of concurrent workers handling this role in a zone.

Note

The Provider Operations role is often required in zones that don’t necessarily contain providers.

For example, enabling the Provider Operations role in a WebUI zone can improve performance by reducing the number of individual EMS connections required for user-initiated VM operations, in favour of a single brokered connection. The Provider Operations role is also required in any zone that may run service-initiated VM provisioning operations.

2.5.10. RHN Mirror

A CFME appliance with the RHN Mirror role acts as a repository server for the latest CloudForms Management Engine RPM packages. It also configures other CFME appliances within the same region to point to itself for updates. This provides a low bandwidth method to update environments with multiple appliances. The role does not have a dedicated worker.

2.5.11. Reporting

The Reporting role allows a CFME appliance to generate reports. There should be at least one CFME appliance with this role in any zone in which reports are automatically scheduled or manually requested/queued.[3] (such as from a WebUI zone).

Enabling this server role starts one or more MiqReportingWorker workers.

2.5.12. Scheduler

The Scheduler sends messages to start all scheduled activities such as report generation, database backups, or to retire VMs or services. One server in each region must be assigned this role or scheduled CloudForms events will not occur. Enabling this server role starts the MiqScheduleWorker worker.

Note

Each CFME appliance also has a schedule worker running but this only handles local appliance task scheduling.

The Scheduler role is for region-specific scheduling and is only active on one appliance per region.

2.5.13. SmartProxy

Enabling the SmartProxy role turns on the embedded SmartProxy on the CFME appliance. The embedded SmartProxy can analyse virtual machines that are registered to a host and templates that are associated with a provider. Enabling this role starts three MiqSmartProxyWorker workers.

2.5.14. SmartState Analysis

The SmartState Analysis role controls which CFME appliances can control SmartState Analyses and process the data from the analysis. There should be at least one of these in each zone that contains a provider. This role does not have a dedicated worker, SmartState tasks are processed by either a MiqGenericWorker or a MiqPriorityWorker, depending on message priority.

Note

The SmartProxy and SmartState Analysis roles are described in more detail in Chapter 10, SmartState Analysis

2.5.15. User Interface

This role enables access to a CFME appliance using the Red Hat CloudForms Operations WebUI console. More than one CFME appliance can have this role in a zone (the default behaviour is to have this role enabled on all appliances). Enabling this server role starts one or more MiqUiWorker workers.

Note

The use of multiple WebUI appliances in conjunction with load balancers is described in more detail in Chapter 11, Web User Interface

2.5.16. Web Services

This role enables the RESTful Web service API on a CFME appliance. More than one CFME appliance can have this role in a zone. Enabling this server role starts one or more MiqWebServiceWorker workers.

Note

The Web Services role is required by the Self-Service User Interface (SSUI). Both the User Interface and Web Services roles must be enabled on a CFME appliance to enable logins to the Operations WebUI

2.5.17. Websocket

This role enables a CFME appliance to be used as a websocket proxy for the VNC and SPICE HTML5 remote access consoles. It is also used by the WebUI notification service. Enabling this server role starts one or more MiqWebsocketWorker workers.

2.5.18. Server Role Zone Affinity

Many server roles - or more accurately their worker processes - have an affinity to the zone with which the hosting CFME appliance is associated. For example messages intended for zone "A" will generally not be processed by worker processes in zone "B".

The following server roles have zone affinity:

  • C&U Metrics Coordinator
  • C&U Metrics Collector
  • C&U Metrics Processor
  • Event Monitor
  • Git Repositories Owner
  • Provider Inventory
  • Provider Operations
  • SmartProxy
  • SmartState Analysis
Note

Some server roles such as Automation Engine have optional zone affinity. If an automate message specifies the zone to be run in, the task will only be processed in that zone. If an automate message doesn’t specify the zone, the task can run anywhere.

2.6. Workers

As can be seen, many of the server roles start worker processes. The currently running worker processes on a CFME appliance can be viewed using the following commands in a root bash shell on an appliance:

vmdb
bin/rake evm:status

The same information can also be seen in the Workers tab of the Configuration → Diagnostics page (see Figure 2.2, “Worker Processes”).

Figure 2.2. Worker Processes

Screenshot


Note

CFME 5.8 has provided a new command that allows the currently running worker processes on the local server and remote servers can be seen, ordered by server and zone:

vmdb
bin/rake evm:status_full

In addition to the workers started by enabling a server role, each appliance has by default four workers that handle more generic tasks: two MiqGenericWorkers and two MiqPriorityWorkers. The MiqPriorityWorkers handle the processing of the highest priority messages (priority 20) in the generic message queue (see Section 2.7, “Messages”).

Generic and Priority workers process tasks for the following server roles:

  • Automate
  • C&U Coordinator
  • Database Operations
  • Notifier
  • Provider Operations
  • SmartState Analysis

2.6.1. Worker Validation

Monitoring the health status of workers becomes important as a CloudForms installation is scaled. A server thread called validate_worker checks that workers are alive (they have recently issued a 'heartbeat' ping.[4]), and are within their time limits and memory thresholds. Some workers such as Refresh and SmartProxy workers have a maximum lifetime of 2 hours to restrict their resource consumption.[5]. If this time limit is exceeded, the validate_worker thread will instruct the worker to exit at the end of its current message processing, and spawn a new replacement.

The following evm.log line shows an example of the normal timeout processing for a RefreshWorker:

INFO -- : MIQ(MiqServer#validate_worker) Worker ⏎
[ManageIQ::Providers::Vmware::InfraManager::RefreshWorker] ⏎
with ID: [1000000258651], PID: [17949], ⏎
GUID: [77362eba-c179-11e6-aaa4-00505695be62] uptime has reached ⏎
the interval of 7200 seconds, requesting worker to exit

The following log line shows an example of an abnormal exit request for a MiqEmsMetricsProcessorWorker that has exceeded its memory threshold (see Section 2.6.2.1, “Worker Memory Thresholds”:

WARN -- : MIQ(MiqServer#validate_worker) Worker [MiqEmsMetricsProcessorWorker] ⏎
with ID: [1000000259290], PID: [15553], ⏎
GUID: [40698326-c18a-11e6-aaa4-00505695be62] process memory usage [598032000] ⏎
exceeded limit [419430400], requesting worker to exit
Tip

The actions of validate_worker can be examined in evm.log by using the following command:

grep 'MiqServer#validate_worker' evm.log

Use this command to check for workers exceeding their memory allocation.

2.6.2. Tuning Workers

It is often a requirement to tune the number of per-appliance workers and their memory thresholds when CloudForms is deployed to manage larger clouds or virtual infrastructures.

2.6.2.1. Worker Memory Thresholds

Each worker type is given an out-of-the-box initial memory threshold. The default values have been chosen to perform well with an 'average' workload, but these sometimes need to be increased, depending on the runtime requirements of the specific CloudForms installation.

2.6.2.2. Adjusting Worker Settings

The count and maximum memory thresholds for most worker types can be tuned from the CloudForms WebUI, in the Workers tab of the Configuration → Settings page for each appliance (see Figure 2.3, “Worker Settings”).

Figure 2.3. Worker Settings

Screenshot


For other workers not listed in this page, the memory threshold settings can be tuned (with caution) in the Configuration → Advanced settings by directly editing the YAML, for example:

:workers:
  :worker_base:
  ...
    :ui_worker:
      :connection_pool_size: 8
      :memory_threshold: 1.gigabytes
      :nice_delta: 1
      :count: 1

2.6.3. Worker Task Allocation

Tasks are dispatched to the various workers in one of three ways:

  1. From a scheduled timer. Some tasks are completely synchronous and predictable, and these are dispatched from a timer. The Schedule worker executes in this way.
  2. From an asynchronous event. Some tasks are asynchronous but require immediate handling to maintain overall system responsiveness, or to ensure that data is not lost. The following workers poll or listen for such events:

    • Event Catcher workers
    • WebUI workers
    • Web Services (REST API) workers
    • Web Socket workers
  3. From a message. Asynchronous tasks that are not time-critical are dispatched to workers using a message queue. The following list shows "queue workers" that receive work from queued messages:

    • Generic workers
    • Priority workers
    • Metrics Collector workers
    • Metrics Processor workers
    • Refresh workers
    • Event Handler workers
    • SmartProxy workers
    • Reporting workers

Many of the queued messages are created by workers dispatching work to other workers. For example, the Schedule worker will queue a message for the SmartProxy workers to initiate a SmartState Analysis. An Event Catcher worker will queue a message for an Event Handler worker to process the event. This will in turn queue a message for a Priority worker to process the event through the automate event switchboard.

Tip

Queue workers process messages in a serial fashion. A worker processes one and only one message at a time.

2.7. Messages

The queue workers receive work instructions from messages, delivered via a VMDB table called miq_queue, and modelled by the Rails class MiqQueue. Each queue worker queries the miq_queue table to look for work for any of its roles. If a message is claimed by a worker, the message state is changed from "ready" to "dequeue" and the worker starts processing the message. When the message processing has completed the message state is updated to indicate "ok", "error" or "timeout". Messages that have completed processing are purged on a regular basis.

2.7.1. Message Prefetch

To improve the performance of the messaging system, each CFME appliance prefetches a batch of messages into its local memcache. When a worker looks for work by searching for a "ready" state message, it calls an MiqQueue method get_message_via_drb that transparently searches the prefetched message copies in the memcache. If a suitable message is found, the message’s state in the VMDB miq_queue table is changed to "dequeue", and the message is processed by the worker.

2.7.2. Message Fields

A message contains a number of fields. The useful ones to be aware of for troubleshooting purposes are described below.

2.7.2.1. Ident

Each message has an Ident field that specifies the worker type that the message is intended for. Messages with an Ident field of 'generic' can be processed by either MiqGenericWorkers or MiqPriorityWorkers, depending on message priority.

2.7.2.2. Role

The message also has a Role field that specifies the server role that the message is intended for. Some workers - the Generic and Priority workers for example - process the messages for several server roles such as Automation Engine or Provider Operations. Workers are aware of the active server roles on their CFME appliance, and only dequeue messages for the enabled server roles.

2.7.2.3. Priority

Messages each have a Priority field such that lower priority messages for the same worker role are processed before higher priority messages (1 = highest, 200 = lowest). For example, priority 90 messages are processed before priority 100 messages regardless of the order in which they were created. The default message priority is 100, but tasks that are considered of greater importance are queued using messages with lower priority numbers. These message priorities are generally hard-coded and not customizable.

2.7.2.4. Zone

Each message has a Zone field that specifies the zone that the receiving worker should be a member of in order to dequeue the message. Some messages are created with the zone field empty, which means that the message can be dequeued and processed by the Ident worker type in any zone.

2.7.2.5. Server

Messages have a Server field, which is only used if the message is intended to be processed by a particular CFME appliance. If used, the field specifies the GUID of the target CFME appliance.

2.7.2.6. Timeout

Each message has a Timeout field. If the dispatching worker has not completed the message task in the time specified by the timeout, the worker will be terminated and a new worker spawned in its place.

2.7.2.7. State

The messages have a State field that describes the current processing status of the message (see below).

2.7.3. Tracing Messages in evm.log

Message processing is so critical to the overall performance of a CloudForms installation, that understanding how to follow messages in evm.log is an important skill to master when scaling CloudForms. There are generally four stages of message processing that can be followed in the log file. For this example a message will be traced that instructs the Automation Engine (role "automate" in queue "generic") to run the method AutomationTask.execute on automation task ID 7829.

2.7.3.1. Stage 1 - Adding a message to the queue.

A worker (or other Rails process) adds a message to the queue by calling MiqQueue.put, passing all associated arguments that the receiving worker needs to process the task. For this example the message should be processed in zone 'RHV', and has a timeout of 600 seconds (automation tasks typically have a 10 minute time period in which to run). The message priority is 100, indicating that a Generic worker rather than Priority worker should process the message (both workers monitor the "generic" queue). The line from evm.log is as follows:

... INFO -- : Q-task_id([automation_request_6298]) MIQ(MiqQueue.put) ⏎
Message id: [32425368], ⏎
id: [], ⏎
Zone: [RHV], ⏎
Role: [automate], ⏎
Server: [], ⏎
Ident: [generic], ⏎
Target id: [], ⏎
Instance id: [7829], ⏎
Task id: [automation_task_7829], ⏎
Command: [AutomationTask.execute], ⏎
Timeout: [600], ⏎
Priority: [100], ⏎
State: [ready], ⏎
Deliver On: [], ⏎
Data: [], ⏎
Args: []

2.7.3.2. Stage 2 - Retrieving a message from the queue.

A Generic worker calls get_message_via_drb to dequeue the next available message. This method searches the prefetched message queue in the memcache for the next available message with a state of "ready". The new message with ID 32425368 is found, so its state is changed to "dequeue" in the VMDB miq_queue table, and the message is dispatched to the worker. The line from evm.log is as follows:

... INFO -- : MIQ(MiqGenericWorker::Runner#get_message_via_drb) ⏎
Message id: [32425368], ⏎
MiqWorker id: [260305], ⏎
Zone: [RHV], ⏎
Role: [automate], ⏎
Server: [], ⏎
Ident: [generic], ⏎
Target id: [], ⏎
Instance id: [7829], ⏎
Task id: [automation_task_7829], ⏎
Command: [AutomationTask.execute], ⏎
Timeout: [600], ⏎
Priority: [100], ⏎
State: [dequeue], ⏎
Deliver On: [], ⏎
Data: [], ⏎
Args: [], ⏎
Dequeued in: [6.698342458] seconds
Tip

The "Dequeued in" value is particularly useful to monitor when scaling CloudForms as this shows the length of time that the message was in the queue before being processed. Although most messages are dequeued within a small number of seconds, a large value does not necessarily indicate a problem. Some messages are queued with a 'Deliver On' time which may be many minutes or hours in the future. The message will not be dequeued until the 'Deliver On' time has expired.

An example of this can be seen in the message to schedule a C&U hourly rollup, as follows:

... State: [dequeue], Deliver On: [2017-04-27 09:00:00 UTC], ⏎
Data: [], Args: ["2017-04-27T08:00:00Z", "hourly"], ⏎
Dequeued in: [2430.509191336] seconds

2.7.3.3. Stage 3 - Delivering the message to the worker.

The MiqQueue class’s deliver method writes to evm.log to indicate that the message is being delivered to a worker, and starts the timeout clock for its processing. The line from evm.log is as follows:

... INFO -- : Q-task_id([automation_task_7829]) ⏎
MIQ(MiqQueue#deliver) Message id: [32425368], Delivering...

2.7.3.4. Stage 4 - Message delivered and work is complete.

Once the worker has finished processing the task associated with the message, the MiqQueue class’s delivered method writes to evm.log to indicate that message processing is complete. The line from evm.log is as follows:

... INFO -- : Q-task_id([automation_task_7829]) ⏎
MIQ(MiqQueue#delivered) ⏎
Message id: [32425368], ⏎
State: [ok], ⏎
Delivered in [23.469068759] seconds
Tip

The "Delivered in" value is particularly useful to monitor when scaling CloudForms as this shows the time that the worker spent processing the task associated with the message.

2.7.4. Monitoring Message Queue Status

The overall performance of any multi-appliance CloudForms installation is largely dependant on the timely processing of messages. Fortunately the internal log_system_status method writes the queue states to evm.log every 5 minutes, and this information can be used to assess message throughput.

To find the numbers of messages currently being processed (in state "dequeue") in each zone, use the following bash command:

grep 'count for state=\["dequeue"\]' evm.log
... Q-task_id([log_status]) MIQ(MiqServer.log_system_status) ⏎
[EVM Server (2768)] MiqQueue count for state=["dequeue"] ⏎
by zone and role: {"RHV"=>{nil=>1, "automate"=>1, ⏎
"ems_metrics_coordinator"=>1, "ems_metrics_collector"=>2, ⏎
"ems_metrics_processor"=>2, "smartproxy"=>1, "smartstate"=>2}, ⏎
nil=>{"database_owner"=>1}}
Tip

Messages that appear to be in state 'dequeue' for longer than their timeout period were probably 'in-flight' when the worker process running them died or was terminated. 

To find the numbers of messages in state "error" in each zone, use the following bash command:

grep 'count for state=\["error"\]' evm.log
... Q-task_id([log_status]) MIQ(MiqServer.log_system_status) ⏎
[EVM Server (2768)] MiqQueue count for state=["error"] ⏎
by zone and role: {"RHV"=>{nil=>36}, "default"=>{nil=>16}, ⏎
"UI Zone"=>{nil=>35}}

To find the numbers of messages in state "ready" that are waiting to be dequeued in each zone, use the following bash command:

grep 'count for state=\["ready"\]' evm.log
... Q-task_id([log_status]) MIQ(MiqServer.log_system_status) ⏎
[EVM Server (2768)] \ MiqQueue count for state=["ready"] ⏎
by zone and role: {"UI Zone"=>{"smartstate"=>15, "smartproxy"=>2, ⏎
nil=>4}, "default"=>{"automate"=>2, nil=>21, "smartstate"=>1, ⏎
"smartproxy"=>1}, "RHV"=>{"automate"=>6, "ems_inventory"=>1, ⏎
nil=>19, "smartstate"=>2, "ems_metrics_processor"=>1259, ⏎
"ems_metrics_collector"=>641}}
Tip

The count for "ready" state elements in the MiqQueue table should not be greater than twice the number of managed objects (e.g. hosts, VMs, storages) in the region. A higher number than this is a good indication that the worker count should be increased, or further CFME appliances deployed to handle the additional workload.

2.8. Summary of Roles, Workers and Messages

The following table summarises the server roles, the workers performing the role tasks, the 'Role' field within the messages handled by those workers, and the maximum number of concurrent instances of the role per region or zone.

RoleWorkerMessage 'Role'Maximum Concurrent Workers

Automation Engine

Generic or Priority

automate

unlimited/ region

C&U Coordinator

Generic or Priority

ems_metrics_coordinator

one/zone

C&U Data Collector

provider-specific MetricsCollectorWorker

ems_metrics_collector

unlimited/ zone

C&U Data Processor

MiqEmsMetricsProcessorWorker

ems_metrics_processor

unlimited/ zone

Database Operations

Generic or Priority

database_owner

unlimited/ region

Embedded Ansible

EmbeddedAnsibleWorker

N/A

one/ region

Event Monitor

MiqEventHandler & provider-specific EventCatchers

event

one/zone & one/ provider/ zone

Git Repositories Owner

N/A

N/A

one/zone

Notifier

Generic or Priority

notifier

one/ region

Provider Inventory

provider-specific RefreshWorker

ems_inventory

one/ provider/ zone

Provider Operations

Generic or Priority

ems_operations

unlimited/ zone

RHN Mirror

N/A

N/A

unlimited/ region

Reporting

MiqReportingWorker

reporting

unlimited/ region

Scheduler

MiqScheduleWorker

N/A

one/ region

SmartProxy

MiqSmartProxyWorker

smartproxy

unlimited/ zone

SmartState Analysis

Generic or Priority

smartstate

unlimited/ zone

User Interface

MiqUiWorker

N/A

unlimited/ region

Web Services

MiqWebServiceWorker

N/A

unlimited/ region

Web Socket

MiqWebsocketWorker

N/A

unlimited/ region



[1] The full list of supported providers and their capabilities is included in the CloudForms Support Matrix document. The most recent Support Matrix document is here: https://access.redhat.com/documentation/en-us/red_hat_cloudforms/4.2/html/support_matrix/
[2] Not all automation tasks are queued. The automate methods that populate dynamic dialog elements, for example, are run immediately on the CFME appliance running the WebUI session, regardless of whether it has the Automation Engine role enabled
[4] Worker processes issue a heartbeat ping every 10 seconds
[5] The time limit for Refresh workers sometimes needs to be increased in very large environments where a full refresh can take longer than 2 hours