Chapter 9. Event Handling
The timely processing of external and internal events is important to the overall smooth running of a CloudForms installation. This section discusses the event handling process and how it can be tuned for scale.
9.1. Event Processing Workflow
The event processing workflow involves 3 different workers, as follows:
-
A provider-specific event catcher polls the EMS event source for new events using an API call such as
https://rhevm/api/events?from=54316
(see Section 9.1.1, “Event Catcher Polling Frequency” for the frequency of this polling). For each new event caught a message is queued for the event handler -
The generic MiqEventHandler worker dequeues the message, and creates an EmsEvent EventStream object. Any EMS-specific references such as
:vm⇒{:id⇒"4e7b66b7-080d-4593-b670-3d6259e47a0f"}
are translated into the equivalent CloudForms object ID such as"VmOrTemplate::vm"⇒1000000000023
, and a new high priority message is queued for automate A Priority worker dequeues the message and processes it through the automate event switchboard using the EventStream object created by the MiqEventHandler. Processing the event may involve several event handler automate instances that perform actions such as:
- Process any control policies associated with the event
- Process any alarms associated with the event
- Initiate any further operations that are required after the event, such as triggering an EMS refresh
The event workflow is illustrated in Figure 9.1, “Event Processing Workflow”
Figure 9.1. Event Processing Workflow

9.1.1. Event Catcher Polling Frequency
The polling frequency of each of the provider-specific event catchers is defined in the :event_catcher
section of the Configuration→Advanced settings. The default settings for CloudForms Management Engine 5.8 are as follows:
:event_catcher: :poll: 1.seconds :event_catcher_ansible_tower: :poll: 20.seconds :event_catcher_embedded_ansible: :poll: 20.seconds :event_catcher_redhat: :poll: 15.seconds :event_catcher_openstack: :poll: 15.seconds :event_catcher_openstack_infra: :poll: 15.seconds :event_catcher_openstack_network: :poll: 15.seconds :event_catcher_hawkular: :poll: 10.seconds :event_catcher_hawkular_datawarehouse: :poll: 1.minute :event_catcher_google: :poll: 15.seconds :event_catcher_kubernetes: :poll: 1.seconds :event_catcher_lenovo: :poll: 4.minutes :event_catcher_openshift: :poll: 1.seconds :event_catcher_cinder: :poll: 10.seconds :event_catcher_swift: :poll: 10.seconds :event_catcher_amazon: :poll: 15.seconds :event_catcher_azure: :poll: 15.seconds :event_catcher_vmware: :poll: 1.seconds :event_catcher_vmware_cloud: :poll: 15.seconds
9.2. Generic Events
Some external management systems implement generic event types that are issued under a variety of conditions. They are often used by third-party software vendors as a means to add their own specific events to those of the native EMS. Generic events often have a sub-type associated with them to indicate a more specific event source.
9.2.1. EventEx
VMware vCenter management systems use an event type called EventEx as a catch-all event. Several VMware components issue EventEx events with a subtype to record state changes, problems, and recovery from problems. They appear as [EventEx]-[subtype]
, for example:
- [EventEx]-[com.vmware.vc.VmDiskConsolidatedEvent]
- [EventEx]-[com.vmware.vim.eam.task.scanForUnknownAgentVmsCompleted]
- [EventEx]-[com.vmware.vim.eam.task.scanForUnknownAgentVmsInitiated]
- [EventEx]-[esx.problem.scsi.device.io.latency.high]
- [EventEx]-[esx.problem.vmfs.heartbeat.recovered]
- [EventEx]-[esx.problem.vmfs.heartbeat.timedout]
- [EventEx]-[vprob.storage.connectivity.lost]
- [EventEx]-[vprob.vmfs.heartbeat.recovered]
- [EventEx]-[vprob.vmfs.heartbeat.timedout]
9.3. Event storms
Event storms are very large bursts of events emitted by a provider’s EMS. They can be caused by several types of warning or failure condition, including storage or adapter problems, or host capacity, swap space usage or other host thresholds being crossed. When a component is failing intermittently the storm is often made worse by events indicating the transition between problem and non-problem state, for example:
[----] I, [2017-01-25T03:23:04.998138 #374:66b14c] ... caught event ⏎ [EventEx]-[esx.clear.scsi.device.io.latency.improved] chainId [427657] [----] I, [2017-01-25T03:23:04.998233 #374:66b14c] ... caught event ⏎ [EventEx]-[esx.problem.scsi.device.io.latency.high] chainId [427658] [----] I, [2017-01-25T03:23:04.998289 #374:66b14c] ... caught event ⏎ [EventEx]-[esx.clear.scsi.device.io.latency.improved] chainId [427659] [----] I, [2017-01-25T03:23:04.998340 #374:66b14c] ... caught event ⏎ [EventEx]-[esx.clear.scsi.device.io.latency.improved] chainId [427660] [----] I, [2017-01-25T03:23:04.998389 #374:66b14c] ... caught event ⏎ [EventEx]-[esx.problem.scsi.device.io.latency.high] chainId [427661] [----] I, [2017-01-25T03:23:04.998435 #374:66b14c] ... caught event ⏎ [EventEx]-[esx.problem.scsi.device.io.latency.high] chainId [427662] [----] I, [2017-01-25T03:23:04.998482 #374:66b14c] ... caught event ⏎ [EventEx]-[esx.clear.scsi.device.io.latency.improved] chainId [427663] [----] I, [2017-01-25T03:23:04.998542 #374:66b14c] ... caught event ⏎ [EventEx]-[esx.clear.scsi.device.io.latency.improved] chainId [427664]
The log snippet above is from a production CloudForms installation. Note that many events are received within the same millisecond - typical of an event storm
Event storms are highly detrimental to the overall performance of a CloudForms region for many reasons, including the following:
- All MiqEventHandler workers in a zone can be overwhelmed processing messages from one provider, to the detriment of other providers in that zone
- The many hundreds of thousands (up to tens of millions) of unprocessed high-priority messages in the miq_queue table consume all Generic and Priority workers in the zone
-
The number of messages in the miq_queue table affects the performance of
get_message_via_drb
for all queue workers in the entire region
In some cases the problems are temporary and clear themselves after the event message emission stops and the CFME appliances can process the messages already queued for processing. In other cases the sheer volume of event messages can result in appliances which still appear to be running, but where the CFME services - including the WebUI - are unresponsive.
9.3.1. Handling and Recovering from Event Storms
Until the cause of the event storm is identified and corrected, the quickest way to restore any operation for the CloudForms environment is to to prevent the continued growth of the miq_queue table. The simplest techniques are to blacklist the event(s) causing the storm (see Section 9.4.1, “Blacklisting Events”), or to disable the event monitor role on all CFME appliance in the provider’s zone.
Disabling the event monitor will disable both the event catcher and event processor workers, so queued messages in the miq_queue table will not be processed. If there are multiple providers in the zone, event catching and handling for these providers may also become inactive.
In critical situations with many hundreds of thousands to millions of queued messages, it may be necessary to selectively delete message instances from the miq_queue table. Since the overwhelming number of messages expected to be in this table will be of type 'event', the following SQL statement can be used to remove all such instances from the miq_queue table:
delete from miq_queue where role = 'event' and class_name = 'EmsEvent';
Before running this query the following points should be noted:
- The only response from this query is a count of the number of messages removed
- The query only deletes messages where the role is 'event' and should not touch any other messages that have been queued
- Even though one single specific event may be responsible for 99+% of the instances, any non-problem event messages will also be deleted.
9.4. Tuning Event Handling
There are several measures that can be taken to tune event handling for scale, including filtering the events that are to be processed or ignored.
9.4.1. Blacklisting Events
Some provider events occur relatively frequently, but are either uninteresting to CloudForms, or processing them would consume excessive resources (such as those typically associated with event storms). Events such as these can be skipped or blacklisted. The event catchers write a list of blacklisted events to evm.log when they start, for example:
... MIQ(ManageIQ::Providers::Redhat::InfraManager::EventCatcher:: ⏎ Runner#after_initialize) EMS [rhevm.bit63.net] as [cfme@internal] ⏎ Event Catcher skipping the following events: ... INFO -- : - UNASSIGNED ... INFO -- : - USER_REMOVE_VG ... INFO -- : - USER_REMOVE_VG_FAILED ... INFO -- : - USER_VDC_LOGIN ... INFO -- : - USER_VDC_LOGIN_FAILED ... INFO -- : - USER_VDC_LOGOUT
These events are defined in the blacklisted_events table in the VMDB. The default rows in the table are as follows:
vmdb_production=# select event_name,provider_model ⏎ from blacklisted_events; event_name | provider_model ----------------------------------------+------------------------------ storageAccounts_listKeys_BeginRequest | ...Azure::CloudManager storageAccounts_listKeys_EndRequest | ...Azure::CloudManager identity.authenticate | ...Openstack::CloudManager scheduler.run_instance.start | ...Openstack::CloudManager scheduler.run_instance.scheduled | ...Openstack::CloudManager scheduler.run_instance.end | ...Openstack::CloudManager ConfigurationSnapshotDeliveryCompleted | ...Amazon::CloudManager ConfigurationSnapshotDeliveryStarted | ...Amazon::CloudManager ConfigurationSnapshotDeliveryFailed | ...Amazon::CloudManager UNASSIGNED | ...Redhat::InfraManager USER_REMOVE_VG | ...Redhat::InfraManager USER_REMOVE_VG_FAILED | ...Redhat::InfraManager USER_VDC_LOGIN | ...Redhat::InfraManager USER_VDC_LOGOUT | ...Redhat::InfraManager USER_VDC_LOGIN_FAILED | ...Redhat::InfraManager AlarmActionTriggeredEvent | ...Vmware::InfraManager AlarmCreatedEvent | ...Vmware::InfraManager AlarmEmailCompletedEvent | ...Vmware::InfraManager AlarmEmailFailedEvent | ...Vmware::InfraManager AlarmReconfiguredEvent | ...Vmware::InfraManager AlarmRemovedEvent | ...Vmware::InfraManager AlarmScriptCompleteEvent | ...Vmware::InfraManager AlarmScriptFailedEvent | ...Vmware::InfraManager AlarmSnmpCompletedEvent | ...Vmware::InfraManager AlarmSnmpFailedEvent | ...Vmware::InfraManager AlarmStatusChangedEvent | ...Vmware::InfraManager AlreadyAuthenticatedSessionEvent | ...Vmware::InfraManager EventEx | ...Vmware::InfraManager UserLoginSessionEvent | ...Vmware::InfraManager UserLogoutSessionEvent | ...Vmware::InfraManager identity.authenticate | ...Openstack::InfraManager scheduler.run_instance.start | ...Openstack::NetworkManager scheduler.run_instance.scheduled | ...Openstack::NetworkManager scheduler.run_instance.end | ...Openstack::NetworkManager ConfigurationSnapshotDeliveryCompleted | ...Amazon::NetworkManager ConfigurationSnapshotDeliveryStarted | ...Amazon::NetworkManager ConfigurationSnapshotDeliveryFailed | ...Amazon::NetworkManager (37 rows)
If processing of any of the events in the blacklisted_events table is required, the enabled field can be set to false and the provider-specific event catcher restarted.
An EMS can also report some minor object property changes as events, even though these are not modelled in the CloudForms VMDB. For VMware providers such event types can be added to the "Vim Broker Exclude List" so that they can be discarded without processing. The exclude list is found under :broker_notify_properties
in the Configuration → Advanced settings, as follows:
:broker_notify_properties: :exclude: :HostSystem: - config.consoleReservation - config.dateTimeInfo - config.network - config.service - summary - summary.overallStatus - summary.runtime.bootTime - summary.runtime.healthSystemRuntime.systemHealthInfo. ⏎ numericSensorInfo :VirtualMachine: - config.locationId - config.memoryAllocation.overheadLimit - config.npivWorldWideNameType - guest.disk - guest.guestFamily - guest.guestFullName - guest.guestId - guest.ipStack - guest.net - guest.screen - guest.screen.height - guest.screen.width - guest.toolsRunningStatus - guest.toolsStatus - resourceConfig - summary - summary.guest.guestFullName - summary.guest.guestId - summary.guest.toolsRunningStatus - summary.overallStatus - summary.runtime.bootTime - summary.runtime.memoryOverhead - summary.runtime.numMksConnections - summary.storage - summary.storage.committed - summary.storage.unshared
9.4.2. Flood Monitoring
CloudForms recently introduced the concept of flood monitoring for the provider-specific event catchers. This stops provider events from being queued when too many duplicates are received in a short time. By default an event is considered as flooding if it is received 30 times in one minute.
Flood monitoring is a generic concept for event processing, but requires the appropriate supporting methods to be added to each provider. As of CloudForms Management Engine 5.8 only the VMware provider supports this functionality.
9.4.3. Event Catcher Configuration
The :event_catcher
section is one of the largest of the Configuration → Advanced settings, and it defines the configuration of each type of event catcher. For example the following extract shows the settings for the ManageIQ::Providers::Openstack::InfraManager::EventCatcher worker:
:event_catcher: ... :event_catcher_openstack: :poll: 15.seconds :topics: :nova: notifications.* :cinder: notifications.* :glance: notifications.* :heat: notifications.* :duration: 10.seconds :capacity: 50 :amqp_port: 5672 :amqp_heartbeat: 30 :amqp_recovery_attempts: 4 :ceilometer: :event_types_regex: "\\A(?!firewall|floatingip|gateway| ⏎ net|port|router|subnet|security_group|vpn)" ...
The configuration settings rarely need to be changed from their defaults.
9.5. Scaling Out
The event processing workflow can be quite resource-intensive. CloudForms installations managing several thousand objects may benefit from dedicated CFME appliances exclusively running the provider-specific EventCatcher workers and MiqEventHandler worker in any zone containing providers.