Chapter 37. High Availability and Fail-over

High availability is defined as the ability for the system to continue functioning after failure of one or more of the servers.
A part of high availability is fail-over which is defined as the ability for client connections to migrate from one server to another in the event of server failure so that client applications can continue to operate.

Warning

HornetQ requires a stable, reliable connection to the file system where its journal is located. If connectivity between HornetQ and the journal is lost and later re-established, an I/O error for messaging will occur. This error is considered a "major event" and requires manual intervention with the messaging system in order to recover (i.e. the messaging system will need to be restarted). If this occurs on a cluster node, other nodes will take on the load of the failed node, providing they have been configured to do so.

37.1. Live - Backup Pairs

HornetQ allows pairs of servers to be linked together as live - backup pairs. In this release there is a single backup server for each live server. A backup server is owned by only one live server. Backup servers are not operational until fail-over occurs.
Before fail-over, only the live server is serving the HornetQ clients while the backup servers remain passive or awaiting to become a backup server. When a live server crashes or is brought down in the correct mode, the backup server currently in passive mode will become live and another backup server will become passive. If a live server restarts after a fail-over then it will have priority and be the next server to become live when the current live server goes down, if the current live server is configured to allow automatic fail back then it will detect the live server coming back up and automatically stop.

37.1.1. HA modes

HornetQ provides only shared store in this release.

Note

Only persistent message data will survive fail-over. Non-persistent message data is lost after fail-over occurs.

37.1.2. Shared Store

When using a shared store, both live and backup servers share the same entire data directory using a shared file system. This means the paging directory, journal directory, large messages and binding journal.
When fail-over occurs and the backup server takes over, it will load the persistent storage from the shared file system and clients can connect to it.

Important

HornetQ HA supports shared store on GFS2 on SAN.
This style of high availability differs from data replication in that it requires a shared file system which is accessible by both the live and backup nodes. Typically this will be some kind of high performance Storage Area Network (SAN). Do not use NFS mounts to store any shared journal when using NIO (non-blocking I/O). Also consider that NFS is not ideal due to the data transfer rate of this standard.
The advantage of shared-store high availability is that no replication occurs between the live and backup nodes, this means it does not suffer any performance penalties due to the overhead of replication during normal operation.
The disadvantage of shared store replication is that it requires a shared file system, and when the backup server activates it needs to load the journal from the shared store which can take some time depending on the amount of data in the store.
If the highest performance during normal operation is required and there is access to a fast SAN, and a slightly slower fail-over is acceptable (depending on amount of data), shared store high availability is recommended.

37.1.2.1. Configuration

To configure the live and backup server to share their store, configure both <JBOSS_HOME>/jboss-as/server/<PROFILE>/deploy/hornetq/hornetq-configuration.xml files on each node:
<shared-store>true</shared-store>
Additionally, the backup server must be flagged explicitly as a backup:
<backup>true</backup>
In order for live - backup pairs to operate properly with a shared store, both servers must have configured the location of journal directory to point to the same shared location (as explained in Section 13.3, “Configuring the message journal”)
The Live and Backup pair must have a cluster connection defined, even if the pair is not part of a cluster. The Cluster Connection info defines how backup servers announce their presence to a live server or any other nodes in the cluster. Refer to Chapter 36, Clusters for details on how to configure this.

37.1.2.2. Failing Back to Live Server

After a live server has failed and a backup has taken over its duties, you may want to restart the live server and have clients fail back. To do this, restart the original live server and stop the new live server. You can do this by terminating the process itself or waiting for the server to shut down.
It is also possible to cause fail-over to occur on normal server shutdown, to enable this set the following property to true in <JBOSS_HOME>/jboss-as/server/<PROFILE>/deploy/hornetq/hornetq-configuration.xml:
<failover-on-shutdown>true</failover-on-shutdown>
You can force the new live server to shutdown when the old live server comes back up allowing the original live server to take over automatically by setting the following property in <JBOSS_HOME>/jboss-as/server/<PROFILE>/deploy/hornetq/hornetq-configuration.xml as follows:
<allow-failback>true</allow-failback>