Disaster Recovery for Ansible Automation Platform on Azure

Updated -

Disaster Recovery for Ansible Automation Platform on Azure

Ansible Automation Platform on Microsoft Azure provides optional additional backup capabilities through a multi-regional model for supported Azure regions. This optional feature is enabled on the “Business Continuity” step during the deployment of the managed application. Enabling this option will backup AAP data from a primary Azure region into a secondary Azure region and incurs additional Azure infrastructure costs for storage in the secondary region. Customers can request this feature be enabled on their instance using a support help request if not selected during deployment.

The disaster recovery feature activates the replication of storage between a primary region and its assigned paired region. AAP on Azure uses Microsoft-defined regional pairs to implement this solution. A list of Azure data center pairs can be found here: Azure Cross-Region Replication (https://learn.microsoft.com/en-us/azure/reliability/cross-region-replication-azure).

It should be noted that disaster recovery is not synonymous with high availability. A loss of service and data can occur when the primary region is impacted.

What is the Regional Support for Disaster recovery

The disaster recovery capability is not supported in all Azure regions. Customers should consult the regional support matrix to verify if their desired region is supported before deployment. While primary region nightly backups are standard for all instances, a secondary region can add additional risk reduction in the event of a catastrophic event in the primary Azure region. We are continuously working with Microsoft to expand this capability as they add support for more regions.

Main Region Multi-Region Disaster Recovery (Y/N) Paired Backup Region
Australia East Y Australia Southeast
Australia Southeast Y Australia East
Brazil South Y South Central US
Canada Central Y Canada East
Canada East Y Canada Central
Central India Y South India
Central US Y East US 2
Chile Central N n/a
East Asia Y Southeast Asia
East US Y West US
East US 2 Y Central U
France Central Y France South
Germany West Central Y Germany North
Indonesia Central N n/a
Israel Central N n/a
Italy North N n/a
Japan East Y Japan West
Japan West Y Japan East
Korea South Y Korea Central
Korea Central Y Korea South
Malaysia West N n/a
Mexico Central N n/a
New Zealand North N n/a
North Central US Y South Central US
North Europe Y West Europe
Norway East Y Norway West
Poland Central N n/a
Qatar Central N n/a
South Africa North Y South Africa West
South Central US Y North Central US
South India Y Central India
Southeast Asia Y East Asia
Spain Central N n/a
Sweden Central Y Sweden South
Switzerland North Y Switzerland West
UAE North Y UAE Central
UK South Y UK West
UK West Y UK South
West Central US Y West US 2
West Europe Y North Europe
West US Y East US
West US 2 Y West Central US
West US 3 Y East US

How does disaster recovery work?

A nightly backup of the managed application is placed on Azure storage for replication. This backup will be loaded into a new deployment of the Ansible Automation Platform in a non-impacted region. The amount of time required to recover an instance depends on the amount of data being recovered and the availability of Azure resources.

How does my application recover from an event?

The following steps should be taken if your managed application's region is experiencing a service-impacting event:

  1. Deploy a new instance of the managed application to a region of your choice. We recommend you use the region pair of your primary region. You must deploy the second instance of the managed application using the same Azure subscription as your primary instance.
    • Note: To ensure smooth data migration, do not set up any network configurations (such as VNet peering) until the data migration is successfully completed and verified. Once the SRE team confirms a successful recovery, you may proceed with network setup.
  2. Contact Red Hat customer support indicating your managed application's region has failed and your managed application needs to be recovered. Provide the following information:
    • Name of the instance impacted
    • Name of the new instance
    • Azure Subscription ID
    • Contact information for rapid collaboration
  3. Red Hat Site Reliability Engineers (SRE) will prioritize the recovery operation. The time required for a full recovery depends on the availability of Azure resources and the amount of data to recover.
  4. A Red Hat representative will contact you using the information supplied in your support request to indicate the process is complete. Priority will be given to ensure that any issues with the new instance will be addressed promptly.

These estimates can help set expectations if a disaster recovery event occurs.

Task Description Who? Estimated Time
Contact Red Hat customer support and raise a Sev 1 case if the event is happening during a product outage of the original region. Customer See Premium Support SLAs
*Deploy a new instance of the managed application. Customer \~1.5 hours
Red Hat Site Reliability Engineers will initiate the recovery operation. SRE \~2 hours
**A Red Hat representative will contact you using the information supplied in your support request to indicate the process is complete. Red Hat Support When the recovery operation is complete within premium support SLA

*Customers will need to perform “post deployment network configuration” steps on the new environment, such as VNET peering and routing rule definitions. The time for this would be the same that it took when configuring items during the initial implementation of Ansible on Azure. Refer to this link for information about customer responsibilities for Ansible Automation Platform on Microsoft Azure.

**DR estimates will differ based on data volume & network/traffic configurations within each customer's environments dependent on the following variables:

  • The database size of the site being recovered (including job history, inventory, and other AAP data).
  • The number of collections stored in the Private Automation Hub.
  • The number of execution environments (EEs) stored in Private Automation Hub.
  • Recovery may also involve re-configuring network routing between regions, depending on how traffic is redirected.

How can disaster recovery be tested?

This process can be scheduled by submitting a support request asking for a disaster recovery test, with a limit of one disaster recovery test every six months.


Comments