Chapter 9. Event Handling

The timely processing of external and internal events is important to the overall smooth running of a CloudForms installation. This section discusses the event handling process and how it can be tuned for scale.

9.1. Event Processing Workflow

The event processing workflow involves 3 different workers, as follows:

  1. A provider-specific event catcher polls the EMS event source for new events using an API call such as https://rhevm/api/events?from=54316 (see Section 9.1.1, “Event Catcher Polling Frequency” for the frequency of this polling). For each new event caught a message is queued for the event handler
  2. The generic MiqEventHandler worker dequeues the message, and creates an EmsEvent EventStream object. Any EMS-specific references such as :vm⇒{:id⇒"4e7b66b7-080d-4593-b670-3d6259e47a0f"} are translated into the equivalent CloudForms object ID such as "VmOrTemplate::vm"⇒1000000000023, and a new high priority message is queued for automate
  3. A Priority worker dequeues the message and processes it through the automate event switchboard using the EventStream object created by the MiqEventHandler. Processing the event may involve several event handler automate instances that perform actions such as:

    • Process any control policies associated with the event
    • Process any alarms associated with the event
    • Initiate any further operations that are required after the event, such as triggering an EMS refresh

The event workflow is illustrated in Figure 9.1, “Event Processing Workflow”

Figure 9.1. Event Processing Workflow

Screenshot


9.1.1. Event Catcher Polling Frequency

The polling frequency of each of the provider-specific event catchers is defined in the :event_catcher section of the Configuration→Advanced settings. The default settings for CloudForms Management Engine 5.8 are as follows:

    :event_catcher:
        :poll: 1.seconds
      :event_catcher_ansible_tower:
        :poll: 20.seconds
      :event_catcher_embedded_ansible:
        :poll: 20.seconds
      :event_catcher_redhat:
        :poll: 15.seconds
      :event_catcher_openstack:
        :poll: 15.seconds
      :event_catcher_openstack_infra:
        :poll: 15.seconds
      :event_catcher_openstack_network:
        :poll: 15.seconds
      :event_catcher_hawkular:
        :poll: 10.seconds
      :event_catcher_hawkular_datawarehouse:
        :poll: 1.minute
      :event_catcher_google:
        :poll: 15.seconds
      :event_catcher_kubernetes:
        :poll: 1.seconds
      :event_catcher_lenovo:
        :poll: 4.minutes
      :event_catcher_openshift:
        :poll: 1.seconds
      :event_catcher_cinder:
        :poll: 10.seconds
      :event_catcher_swift:
        :poll: 10.seconds
      :event_catcher_amazon:
        :poll: 15.seconds
      :event_catcher_azure:
        :poll: 15.seconds
      :event_catcher_vmware:
        :poll: 1.seconds
      :event_catcher_vmware_cloud:
        :poll: 15.seconds

9.2. Generic Events

Some external management systems implement generic event types that are issued under a variety of conditions. They are often used by third-party software vendors as a means to add their own specific events to those of the native EMS. Generic events often have a sub-type associated with them to indicate a more specific event source.

9.2.1. EventEx

VMware vCenter management systems use an event type called EventEx as a catch-all event. Several VMware components issue EventEx events with a subtype to record state changes, problems, and recovery from problems. They appear as [EventEx]-[subtype], for example: 

  • [EventEx]-[com.vmware.vc.VmDiskConsolidatedEvent]
  • [EventEx]-[com.vmware.vim.eam.task.scanForUnknownAgentVmsCompleted]
  • [EventEx]-[com.vmware.vim.eam.task.scanForUnknownAgentVmsInitiated]
  • [EventEx]-[esx.problem.scsi.device.io.latency.high]
  • [EventEx]-[esx.problem.vmfs.heartbeat.recovered]
  • [EventEx]-[esx.problem.vmfs.heartbeat.timedout]
  • [EventEx]-[vprob.storage.connectivity.lost]
  • [EventEx]-[vprob.vmfs.heartbeat.recovered]
  • [EventEx]-[vprob.vmfs.heartbeat.timedout]

9.3. Event storms

Event storms are very large bursts of events emitted by a provider’s EMS. They can be caused by several types of warning or failure condition, including storage or adapter problems, or host capacity, swap space usage or other host thresholds being crossed. When a component is failing intermittently the storm is often made worse by events indicating the transition between problem and non-problem state, for example:

[----] I, [2017-01-25T03:23:04.998138 #374:66b14c]  ... caught event ⏎
[EventEx]-[esx.clear.scsi.device.io.latency.improved] chainId [427657]
[----] I, [2017-01-25T03:23:04.998233 #374:66b14c]  ... caught event ⏎
[EventEx]-[esx.problem.scsi.device.io.latency.high] chainId [427658]
[----] I, [2017-01-25T03:23:04.998289 #374:66b14c]  ... caught event ⏎
[EventEx]-[esx.clear.scsi.device.io.latency.improved] chainId [427659]
[----] I, [2017-01-25T03:23:04.998340 #374:66b14c]  ... caught event ⏎
[EventEx]-[esx.clear.scsi.device.io.latency.improved] chainId [427660]
[----] I, [2017-01-25T03:23:04.998389 #374:66b14c]  ... caught event ⏎
[EventEx]-[esx.problem.scsi.device.io.latency.high] chainId [427661]
[----] I, [2017-01-25T03:23:04.998435 #374:66b14c]  ... caught event ⏎
[EventEx]-[esx.problem.scsi.device.io.latency.high] chainId [427662]
[----] I, [2017-01-25T03:23:04.998482 #374:66b14c]  ... caught event ⏎
[EventEx]-[esx.clear.scsi.device.io.latency.improved] chainId [427663]
[----] I, [2017-01-25T03:23:04.998542 #374:66b14c]  ... caught event ⏎
[EventEx]-[esx.clear.scsi.device.io.latency.improved] chainId [427664]
Note

The log snippet above is from a production CloudForms installation. Note that many events are received within the same millisecond - typical of an event storm

Event storms are highly detrimental to the overall performance of a CloudForms region for many reasons, including the following:

  • All MiqEventHandler workers in a zone can be overwhelmed processing messages from one provider, to the detriment of other providers in that zone
  • The many hundreds of thousands (up to tens of millions) of unprocessed high-priority messages in the miq_queue table consume all Generic and Priority workers in the zone
  • The number of messages in the miq_queue table affects the performance of get_message_via_drb for all queue workers in the entire region

In some cases the problems are temporary and clear themselves after the event message emission stops and the CFME appliances can process the messages already queued for processing. In other cases the sheer volume of event messages can result in appliances which still appear to be running, but where the CFME services - including the WebUI - are unresponsive.

9.3.1. Handling and Recovering from Event Storms

Until the cause of the event storm is identified and corrected, the quickest way to restore any operation for the CloudForms environment is to to prevent the continued growth of the miq_queue table. The simplest techniques are to blacklist the event(s) causing the storm (see Section 9.4.1, “Blacklisting Events”), or to disable the event monitor role on all CFME appliance in the provider’s zone. 

Note

Disabling the event monitor will disable both the event catcher and event processor workers, so queued messages in the miq_queue table will not be processed. If there are multiple providers in the zone, event catching and handling for these providers may also become inactive.

In critical situations with many hundreds of thousands to millions of queued messages, it may be necessary to selectively delete message instances from the miq_queue table. Since the overwhelming number of messages expected to be in this table will be of type 'event', the following SQL statement can be used to remove all such instances from the miq_queue table:

delete from miq_queue where role = 'event' and class_name = 'EmsEvent';

Before running this query the following points should be noted:

  • The only response from this query is a count of the number of messages removed

  • The query only deletes messages where the role is 'event' and should not touch any other messages that have been queued
  • Even though one single specific event may be responsible for 99+% of the instances, any non-problem event messages will also be deleted.

9.4. Tuning Event Handling

There are several measures that can be taken to tune event handling for scale, including filtering the events that are to be processed or ignored.

9.4.1. Blacklisting Events

Some provider events occur relatively frequently, but are either uninteresting to CloudForms, or processing them would consume excessive resources (such as those typically associated with event storms). Events such as these can be skipped or blacklisted. The event catchers write a list of blacklisted events to evm.log when they start, for example:

... MIQ(ManageIQ::Providers::Redhat::InfraManager::EventCatcher:: ⏎
Runner#after_initialize) EMS [rhevm.bit63.net] as [cfme@internal] ⏎
Event Catcher skipping the following events:
... INFO -- :   - UNASSIGNED
... INFO -- :   - USER_REMOVE_VG
... INFO -- :   - USER_REMOVE_VG_FAILED
... INFO -- :   - USER_VDC_LOGIN
... INFO -- :   - USER_VDC_LOGIN_FAILED
... INFO -- :   - USER_VDC_LOGOUT

These events are defined in the blacklisted_events table in the VMDB. The default rows in the table are as follows:

vmdb_production=# select event_name,provider_model ⏎
from blacklisted_events;
               event_name               |    provider_model
----------------------------------------+------------------------------
 storageAccounts_listKeys_BeginRequest  | ...Azure::CloudManager
 storageAccounts_listKeys_EndRequest    | ...Azure::CloudManager
 identity.authenticate                  | ...Openstack::CloudManager
 scheduler.run_instance.start           | ...Openstack::CloudManager
 scheduler.run_instance.scheduled       | ...Openstack::CloudManager
 scheduler.run_instance.end             | ...Openstack::CloudManager
 ConfigurationSnapshotDeliveryCompleted | ...Amazon::CloudManager
 ConfigurationSnapshotDeliveryStarted   | ...Amazon::CloudManager
 ConfigurationSnapshotDeliveryFailed    | ...Amazon::CloudManager
 UNASSIGNED                             | ...Redhat::InfraManager
 USER_REMOVE_VG                         | ...Redhat::InfraManager
 USER_REMOVE_VG_FAILED                  | ...Redhat::InfraManager
 USER_VDC_LOGIN                         | ...Redhat::InfraManager
 USER_VDC_LOGOUT                        | ...Redhat::InfraManager
 USER_VDC_LOGIN_FAILED                  | ...Redhat::InfraManager
 AlarmActionTriggeredEvent              | ...Vmware::InfraManager
 AlarmCreatedEvent                      | ...Vmware::InfraManager
 AlarmEmailCompletedEvent               | ...Vmware::InfraManager
 AlarmEmailFailedEvent                  | ...Vmware::InfraManager
 AlarmReconfiguredEvent                 | ...Vmware::InfraManager
 AlarmRemovedEvent                      | ...Vmware::InfraManager
 AlarmScriptCompleteEvent               | ...Vmware::InfraManager
 AlarmScriptFailedEvent                 | ...Vmware::InfraManager
 AlarmSnmpCompletedEvent                | ...Vmware::InfraManager
 AlarmSnmpFailedEvent                   | ...Vmware::InfraManager
 AlarmStatusChangedEvent                | ...Vmware::InfraManager
 AlreadyAuthenticatedSessionEvent       | ...Vmware::InfraManager
 EventEx                                | ...Vmware::InfraManager
 UserLoginSessionEvent                  | ...Vmware::InfraManager
 UserLogoutSessionEvent                 | ...Vmware::InfraManager
 identity.authenticate                  | ...Openstack::InfraManager
 scheduler.run_instance.start           | ...Openstack::NetworkManager
 scheduler.run_instance.scheduled       | ...Openstack::NetworkManager
 scheduler.run_instance.end             | ...Openstack::NetworkManager
 ConfigurationSnapshotDeliveryCompleted | ...Amazon::NetworkManager
 ConfigurationSnapshotDeliveryStarted   | ...Amazon::NetworkManager
 ConfigurationSnapshotDeliveryFailed    | ...Amazon::NetworkManager
(37 rows)

If processing of any of the events in the blacklisted_events table is required, the enabled field can be set to false and the provider-specific event catcher restarted.

An EMS can also report some minor object property changes as events, even though these are not modelled in the CloudForms VMDB. For VMware providers such event types can be added to the "Vim Broker Exclude List" so that they can be discarded without processing. The exclude list is found under :broker_notify_properties in the Configuration → Advanced settings, as follows:

:broker_notify_properties:
  :exclude:
    :HostSystem:
    - config.consoleReservation
    - config.dateTimeInfo
    - config.network
    - config.service
    - summary
    - summary.overallStatus
    - summary.runtime.bootTime
    - summary.runtime.healthSystemRuntime.systemHealthInfo. ⏎
         numericSensorInfo
    :VirtualMachine:
    - config.locationId
    - config.memoryAllocation.overheadLimit
    - config.npivWorldWideNameType
    - guest.disk
    - guest.guestFamily
    - guest.guestFullName
    - guest.guestId
    - guest.ipStack
    - guest.net
    - guest.screen
    - guest.screen.height
    - guest.screen.width
    - guest.toolsRunningStatus
    - guest.toolsStatus
    - resourceConfig
    - summary
    - summary.guest.guestFullName
    - summary.guest.guestId
    - summary.guest.toolsRunningStatus
    - summary.overallStatus
    - summary.runtime.bootTime
    - summary.runtime.memoryOverhead
    - summary.runtime.numMksConnections
    - summary.storage
    - summary.storage.committed
    - summary.storage.unshared

9.4.2. Flood Monitoring

CloudForms recently introduced the concept of flood monitoring for the provider-specific event catchers. This stops provider events from being queued when too many duplicates are received in a short time. By default an event is considered as flooding if it is received 30 times in one minute.

Flood monitoring is a generic concept for event processing, but requires the appropriate supporting methods to be added to each provider. As of CloudForms Management Engine 5.8 only the VMware provider supports this functionality.

9.4.3. Event Catcher Configuration

The :event_catcher section is one of the largest of the Configuration → Advanced settings, and it defines the configuration of each type of event catcher. For example the following extract shows the settings for the ManageIQ::Providers::Openstack::InfraManager::EventCatcher worker:

    :event_catcher:
...
      :event_catcher_openstack:
        :poll: 15.seconds
        :topics:
          :nova: notifications.*
          :cinder: notifications.*
          :glance: notifications.*
          :heat: notifications.*
        :duration: 10.seconds
        :capacity: 50
        :amqp_port: 5672
        :amqp_heartbeat: 30
        :amqp_recovery_attempts: 4
        :ceilometer:
          :event_types_regex: "\\A(?!firewall|floatingip|gateway| ⏎
          net|port|router|subnet|security_group|vpn)"
...

The configuration settings rarely need to be changed from their defaults.

9.5. Scaling Out

The event processing workflow can be quite resource-intensive. CloudForms installations managing several thousand objects may benefit from dedicated CFME appliances exclusively running the provider-specific EventCatcher workers and MiqEventHandler worker in any zone containing providers.