Disaster Recovery for Ansible Automation Platform on Azure

Updated -

Overview

The Ansible Automation Platform on Azure can recover from a service-impacting event in an Azure region. This optional feature is enabled on the “Business Continuity” step during the installation of the managed application. Enabling this option incurs additional Azure infrastructure costs. Current customers can request this feature be enabled on their instance using a support help request.

The disaster recovery feature activates the replication of storage between a primary region and its assigned paired region. These paired data centers are located geographically distant to account for natural events. A list of Azure data center pairs can be found here: Azure Cross-Region Replication.

It should be noted that disaster recovery is not synonymous with high availability. A loss of service and data can occur when the primary region is impacted.

How does disaster recovery work?

A nightly backup of the managed application is placed on Azure storage for replication. This backup will be loaded into a new deployment of the Ansible Automation Platform in a non-impacted region. The amount of time required to recover an instance depends on the amount of data being recovered and the availability of Azure resources.

How does my application recover from an event?

The following steps should be taken if your managed application's region is experiencing a service-impacting event:

  1. Deploy a new instance of the managed application to a region of your choice. We recommend you using the region pair of your primary region. You must deploy the second instance of the managed application using the same Azure subscription as your primary instance.
    Note: To ensure smooth data migration, do not set up any network configurations (such as VNet peering) until the data migration is successfully completed and verified. Once the SRE team confirms a successful recovery, you may proceed with network setup.
  2. Contact Red Hat customer support indicating your managed application's region has failed and your managed application needs to be recovered. Provide the following information:
    • Name of the instance impacted
    • Name of the new instance
    • Azure Subscription ID
    • Contact information for rapid collaboration
  3. Red Hat Site Reliability Engineers will prioritize the recovery operation. The time required for a full recovery depends on the availability of Azure resources and the amount of data to recover.
  4. A Red Hat representative will contact you using the information supplied in your support request to indicate the process is complete. Priority will be given to ensure that any issues with the new instance will be addressed promptly.

These estimates can help set expectations if a disaster recovery event occurs.

Task Description Who? Estimated Time
Contact Red Hat customer support and raise a Sev 1 case if the event is happening during a product outage of the original region. Customer See Premium Support SLAs
*Deploy a new instance of the managed application. Customer ~1.5 hours
Red Hat Site Reliability Engineers will initiate the recovery operation. SRE ~2 hours
**A Red Hat representative will contact you using the information supplied in your support request to indicate the process is complete. Red Hat Support When the recovery operation is complete within premium support SLA

*Customers will need to perform “post deployment network configuration” steps on the new environment, such as VNET peering and routing rule definitions. The time for this would be the same that it took when configuring items during the initial implementation of Ansible on Azure. Refer to this link for information about customer responsibilities for Ansible Automation Platform on Microsoft Azure.

**DR estimates will differ based on data volume & network/traffic configurations within each customer's environments dependent on the following variables:

  • The database size of the site being recovered (including job history, inventory, and other AAP data).
  • The number of collections stored in the Private Automation Hub.
  • The number of execution environments (EEs) stored in Private Automation Hub.
  • Recovery may also involve re-configuring network routing between regions, depending on how traffic is redirected.

How can disaster recovery be tested?

Red Hat encourages customers to periodically test disaster recovery procedures. This process can be scheduled by submitting a support request asking for a disaster recovery test, with a limit of one disaster recovery test every six months.

Comments