Disaster Recovery for Ansible Automation Platform on Azure
Table of Contents
Disaster Recovery for Ansible Automation Platform on Azure
Ansible Automation Platform on Microsoft Azure provides optional additional backup capabilities through a multi-regional model for supported Azure regions. This optional feature is enabled on the “Business Continuity” step during the deployment of the managed application. Enabling this option will backup AAP data from a primary Azure region into a secondary Azure region and incurs additional Azure infrastructure costs for storage in the secondary region. Customers can request this feature be enabled on their instance using a support help request if not selected during deployment.
The disaster recovery feature activates the replication of storage between a primary region and its assigned paired region. AAP on Azure uses Microsoft-defined regional pairs to implement this solution. A list of Azure data center pairs can be found here: Azure Cross-Region Replication (https://learn.microsoft.com/en-us/azure/reliability/cross-region-replication-azure).
It should be noted that disaster recovery is not synonymous with high availability. A loss of service and data can occur when the primary region is impacted.
What is the Regional Support for Disaster recovery
The disaster recovery capability is not supported in all Azure regions. Customers should consult the regional support matrix to verify if their desired region is supported before deployment. While primary region nightly backups are standard for all instances, a secondary region can add additional risk reduction in the event of a catastrophic event in the primary Azure region. We are continuously working with Microsoft to expand this capability as they add support for more regions.
Main Region | Multi-Region Disaster Recovery (Y/N) | Paired Backup Region |
---|---|---|
Australia East | Y | Australia Southeast |
Australia Southeast | Y | Australia East |
Brazil South | Y | South Central US |
Canada Central | Y | Canada East |
Canada East | Y | Canada Central |
Central India | Y | South India |
Central US | Y | East US 2 |
Chile Central | N | n/a |
East Asia | Y | Southeast Asia |
East US | Y | West US |
East US 2 | Y | Central U |
France Central | Y | France South |
Germany West Central | Y | Germany North |
Indonesia Central | N | n/a |
Israel Central | N | n/a |
Italy North | N | n/a |
Japan East | Y | Japan West |
Japan West | Y | Japan East |
Korea South | Y | Korea Central |
Korea Central | Y | Korea South |
Malaysia West | N | n/a |
Mexico Central | N | n/a |
New Zealand North | N | n/a |
North Central US | Y | South Central US |
North Europe | Y | West Europe |
Norway East | Y | Norway West |
Poland Central | N | n/a |
Qatar Central | N | n/a |
South Africa North | Y | South Africa West |
South Central US | Y | North Central US |
South India | Y | Central India |
Southeast Asia | Y | East Asia |
Spain Central | N | n/a |
Sweden Central | Y | Sweden South |
Switzerland North | Y | Switzerland West |
UAE North | Y | UAE Central |
UK South | Y | UK West |
UK West | Y | UK South |
West Central US | Y | West US 2 |
West Europe | Y | North Europe |
West US | Y | East US |
West US 2 | Y | West Central US |
West US 3 | Y | East US |
How does disaster recovery work?
A nightly backup of the managed application is placed on Azure storage for replication. This backup will be loaded into a new deployment of the Ansible Automation Platform in a non-impacted region. The amount of time required to recover an instance depends on the amount of data being recovered and the availability of Azure resources.
How does my application recover from an event?
The following steps should be taken if your managed application's region is experiencing a service-impacting event:
- Deploy a new instance of the managed application to a region of your choice. We recommend you use the region pair of your primary region. You must deploy the second instance of the managed application using the same Azure subscription as your primary instance.
- Note: To ensure smooth data migration, do not set up any network configurations (such as VNet peering) until the data migration is successfully completed and verified. Once the SRE team confirms a successful recovery, you may proceed with network setup.
- Contact Red Hat customer support indicating your managed application's region has failed and your managed application needs to be recovered. Provide the following information:
- Name of the instance impacted
- Name of the new instance
- Azure Subscription ID
- Contact information for rapid collaboration
- Red Hat Site Reliability Engineers (SRE) will prioritize the recovery operation. The time required for a full recovery depends on the availability of Azure resources and the amount of data to recover.
- A Red Hat representative will contact you using the information supplied in your support request to indicate the process is complete. Priority will be given to ensure that any issues with the new instance will be addressed promptly.
These estimates can help set expectations if a disaster recovery event occurs.
Task Description | Who? | Estimated Time |
---|---|---|
Contact Red Hat customer support and raise a Sev 1 case if the event is happening during a product outage of the original region. | Customer | See Premium Support SLAs |
*Deploy a new instance of the managed application. | Customer | \~1.5 hours |
Red Hat Site Reliability Engineers will initiate the recovery operation. | SRE | \~2 hours |
**A Red Hat representative will contact you using the information supplied in your support request to indicate the process is complete. | Red Hat Support | When the recovery operation is complete within premium support SLA |
*Customers will need to perform “post deployment network configuration” steps on the new environment, such as VNET peering and routing rule definitions. The time for this would be the same that it took when configuring items during the initial implementation of Ansible on Azure. Refer to this link for information about customer responsibilities for Ansible Automation Platform on Microsoft Azure.
**DR estimates will differ based on data volume & network/traffic configurations within each customer's environments dependent on the following variables:
- The database size of the site being recovered (including job history, inventory, and other AAP data).
- The number of collections stored in the Private Automation Hub.
- The number of execution environments (EEs) stored in Private Automation Hub.
- Recovery may also involve re-configuring network routing between regions, depending on how traffic is redirected.
How can disaster recovery be tested?
This process can be scheduled by submitting a support request asking for a disaster recovery test, with a limit of one disaster recovery test every six months.
Comments