9.3.3. Total Cluster Failure

The cluster guarantees availability as long as one active primary broker or ready backup broker is left alive. If all brokers fail simultaneously, the cluster fails and non-persistent data is lost.
Brokers are in one of 6 states:
  1. standalone: not part of a HA cluster
  2. joining: newly started backup, not yet joined to the cluster.
  3. catch-up: backup has connected to the primary and is downloading queues, messages etc.
  4. ready: backup is connected and actively replicating from primary, it is ready to take over.
  5. recovering: newly-promoted to primary, waiting for backups to catch up before serving clients. Only a single primary broker can be recovering at a time.
  6. active: serving clients, only a single primary broker can be active at a time.
While there is an active primary broker, clients can get service. If the active primary fails, one of the "ready" backup brokers takes over, recovers and becomes active. A backup can only be promoted to primary if it is in the "ready" state (with the exception of the first primary in a new cluster where all brokers are in the "joining" state)
Given a stable cluster of N brokers with one active primary and N-1 ready backups, the system can sustain N-1 failures in rapid succession. The surviving broker will be promoted to active and continue to give service.
However, at this point the system cannot sustain a failure of the surviving broker until at least one of the other brokers recovers, catches up and becomes a ready backup. If the surviving broker fails before that the cluster will fail in one of two modes (depending on the exact timing of failures).
1. The cluster hangs

All brokers are in joining or catch-up mode. rgmanager tries to promote a new primary but cannot find any candidates and so gives up. clustat will show that the qpidd services are running but the the qpidd-primary service has stopped, something like this:

Table 9.3. 

Service Name Owner (Last) State
service:mrg33-qpidd-service
20.0.10.33
started
service:mrg34-qpidd-service
20.0.10.34
started
service:mrg35-qpidd-service
20.0.10.35
started
service:qpidd-primary-service
(20.0.10.33)
stopped
Eventually all brokers become stuck in "joining" mode, as shown by qpid-ha status --all.
At this point you need to restart the cluster in one of the following ways:
Restart the entire cluster

  • In luci:<your-cluster>:Nodes click reboot to restart the entire cluster.
  • or stop and restart the cluster with ccs --stopall; ccs --startall

Restart just the Qpid services

  • In luci:<your-cluster>:Service Groups:
    • select all the qpidd (not primary) services, click restart.
    • select the qpidd-primary service, click restart.
  • or stop the primary and qpidd services with clusvcadm, then restart (primary last)

2. The cluster reboots

A new primary is promoted and the cluster is functional. All non-persistent data from before the failure is lost.