8.7. Failover and Resilience
Failover means setting up multiple units, and configuring them so that if one unit fails, another one will take over and continue the service without interruption.
Resilience ensures that when the network connection to a unit is interrupted and then reconnected, the service is not interrupted within a reasonable timeframe.
Some Hardware Security Module (HSM) models offer failover or resilience of varying degrees. For detail on the exact make and models and the features that they offer, consult your HSM manual, or contact the manufacturer. The HSMs described in the following sections have been tested with Red Hat Certificate System.
8.7.1. nCipher nShield HSM
With nShield Connect 6000, failover has been tested in the scenario where there are two HSM modules, nShield1, and nShield2, both running and configured for failover.
If one of nShield units goes down, the other exhibits ability to continue the provision of cryptographic services to Certificate System with no known issues, without restarting of the RHCS instance.
When the above situation happens (one HSM unit goes down), the administrator is expected to schedule a downtime for all the connected Certificate System instances and fix the down hsm unit and bring it back up and restart the instances. This means that if one unit goes down, Certificate System is expected to continue functioning; however, if the down hsm is brought back up without restarting the instances, the newly brought up HSM unit is not expected to be part of the failover scheme as originally planned.
With nShield Connect 6000, testing has shown that when the network cable is pulled off the HSM unit, and replugged in within up to 90 minutes, the service continues. There is no data for any time period longer than 90 minutes.
8.7.2. Gemalto Safenet LunaSA HSM
The Gemalto Safenet LunaSA Cloning model offers Failover. However, we have no data on this model.