Chapter 23. Participant Crash Recovery

A key requirement of a transaction service is to be resilient to a system crash by a host running a participant, as well as the host running the transaction coordination services. Crashes which happen before a transaction terminates or before a business activity completes are relatively easy to accommodate. The transaction service and participants can adopt a presumed abort policy.

Procedure 23.1. Presumed Abort Policy

  1. If the coordinator crashes, it can assume that any transaction it does not know about is invalid, and reject a participant request which refers to such a transaction.
  2. If the participant crashes, it can forget any provisional changes it has made, and reject any request from the coordinator service to prepare a transaction or complete a business activity.
Crash recovery is more complex if the crash happens during a transaction commit operation, or between completing and closing a business activity. The transaction service must ensure as far as possible that participants arrive at a consistent outcome for the transaction.
WS-AT Transaction
The transaction needs to commit all provisional changes or roll them all back to the state before the transaction started.
WS-Business Activity Transaction
All participants need to close the activity or cancel the activity, and run any required compensating actions.
On the rare occasions where such a consensus cannot be reached, the transaction service must log and report transaction failures.
XTS includes support for automatic recovery of WS-AT and WS-BA transactions, if either or both of the coordinator and participant hosts crashes. The XTS recovery manager begins execution on coordinator and participant hosts when the XTS service restarts. On a coordinator host, the recovery manager detects any WS-AT transactions which have prepared but not committed, as well as any WS-BA transactions which have completed but not yet closed. It ensures that all their participants are rolled forward in the first case, or closed in the second.
On a participant host, the recovery manager detects any prepared WS-AT participants which have not responded to a transaction rollback, and any completed WS-BA participants which have not yet responded to an activity cancel request, and ensures that the former are rolled back and the latter are compensated. The recovery service also allows for recovery of subordinate WS-AT transactions and their participants if a crash occurs on a host where an interposed WS-AT coordinator has been employed.

23.1. WS-AT Recovery

23.1.1. WS-AT Coordinator Crash Recovery

The WS-AT coordination service tracks the status of each participant in a transaction as the transaction progresses through its two-phase commit. When all participants have been sent a prepare message and have responded with a prepared message, the coordinator writes a log record storing each participant's details, indicating that the transaction is ready to complete. If the coordinator service crashes after this point has been reached, completion of the two-phase commit protocol is still guaranteed, by reading the log file after reboot and sending a commit message to each participant. Once all participants have responded to the commit with a committed message, the coordinator can safely delete the log entry.
Since the prepared messages returned by the participants imply that they are ready to commit their provisional changes and make them permanent, this type of recovery is safe. Additionally, the coordinator does not need to account for any commit messages which may have been sent before the crash, or resend messages if it crashes several times. The XTS participant implementation is resilient to redelivery of the commit messages. If the participant has implemented the recovery functions described in Section 23.1.2.1, “WS-AT Participant Crash Recovery APIs”, the coordinator can guarantee delivery of commit messages if both it crashes, and one or more of the participant service hosts also crash, at the same time.
If the coordination service crashes before the prepare phase completes, the presumed abort protocol ensures that participants are rolled back. After system restart, the coordination service has the information about all the transactions which could have entered the commit phase before the reboot, since they have entries in the log. It also knows about any active transactions started after the reboot. If a participant is waiting for a response, after sending its prepared message, it automatically re sends the prepared message at regular intervals. When the coordinator detects a transaction which is not active and has no entry in the log file after the reboot, it instructs the participant to abort, ensuring that the web service gets a chance to roll back any provisional state changes it made on behalf of the transaction.
A web service may decide to unilaterally commit or roll back provisional changes associated with a given participant, if configured to time out after a specified length of time without a response. In this situation, the web service should record this action and log a message to persistent storage. When the participant receives a request to commit or roll back, it should throw an exception if its unilateral decision action does not match the requested action. The coordinator detects the exception and logs a message marking the outcome as heuristic. It also saves the state of the transaction permanently in the transaction log, to be inspected and reconciled by an administrator.

23.1.2. WS-AT Participant Crash Recovery

WS-AT participants associated with a transactional web service do not need to be involved in crash recovery if the Web service's host machine crashes before the participant is told to prepare. The coordinator will assume that the transaction has aborted, and the Web service can discard any information associated with unprepared transactions when it reboots.
When a participant is told to prepare, the Web service is expected to save to persistent storage the transactional state it needs to commit or roll back the transaction. The specific information it needs to save is dependent on the implementation and business logic of the Web Service. However, the participant must save this state before returning a Prepared vote from the prepare call. If the participant cannot save the required state, or there is some other problem servicing the request made by the client, it must return an Aborted vote.
The XTS participant services running on a Web Service's host machine cooperate with the Web service implementation to facilitate participant crash recovery. These participant services are responsible for calling the participant's prepare, commit, and rollback methods. The XTS implementation tracks the local state of every enlisted participant. If the prepare call returns a Prepared vote, the XTS implementation ensures that the participant state is logged to the local transaction log before forwarding a prepared message to the coordinator.
A participant log record contains information identifying the participant, its transaction, and its coordinator. This is enough information to allow the rebooted XTS implementation to reinstate the participant as active and to continue communication with the coordinator, as though the participant had been enlisted and driven to the prepared state. However, a participant instance is still necessary for the commit or rollback process to continue.
Full recovery requires the log record to contain information needed by the Web service which enlisted the participant. This information must allow it to recreate an equivalent participant instance, which can continue the commit process to completion, or roll it back if some other Web Service fails to prepare. This information might be as simple as a String key which the participant can use to locate the data it made persistent before returning its Prepared vote. It may be as complex as a serialized object tree containing the original participant instance and other objects created by the Web service.
If a participant instance implements the relevant interface, the XTS implementation will append this participant recovery state to its log record before writing it to persistent storage. In the event of a crash, the participant recovery state is retrieved from the log and passed to the Web Service which created it. The Web Service uses this state to create a new participant, which the XTS implementation uses to drive the transaction to completion. Log records are only deleted after the participant's commit or rollback method is called.

Warning

If a crash happens just before or just after a commit method is called, a commit or rollback method may be called twice.

23.1.2.1. WS-AT Participant Crash Recovery APIs

23.1.2.1.1. Saving Participant Recovery State
To signal that it is capable of performing recovery processing, a participant can implement the java.lang.Serializable interface. Alternatively it may implement Example 23.1, “The PersistableATParticipant Interface”.

Example 23.1. The PersistableATParticipant Interface

	      public interface PersistableATParticipant
	      {
	        byte[] getRecoveryState() throws Exception;
	      }
If a participant implements the Serializable interface, the XTS participant services implementation uses the serialization API to create a version of the participant which can be appended to the participant log entry. If it implements the PersistableATParticipant interface, the XTS participant services implementation call the getRecoveryState method to obtain the state to be appended to the participant log entry.
If neither of these APIs is implemented, the XTS implementation logs a warning message and proceeds without saving any recovery state. In the event of a crash on the host machine for the Web service during commit, the transaction cannot be recovered and a heuristic outcome may occur. This outcome is logged on the host running the coordinator services.
23.1.2.1.2. Recovering Participants at Reboot
A Web service must register with the XTS implementation when it is deployed, and unregister when it is undeployed, in order to participate in recovery processing. Registration is performed using class XTSATRecoveryManager defined in package org.jboss.jbossts.xts.recovery.participant.at.

Example 23.2. Registering for Recovery

public abstract class XTSATRecoveryManager {
	. . .
 	public static XTSATRecoveryManager getRecoveryManager() ;
 	public void registerRecoveryModule(XTSATRecoveryModule module);
	public abstract void unregisterRecoveryModule(XTSATRecoveryModule module)
		throws NoSuchElementException;
	. . .
}
The Web service must provide an implementation of the XTSATRecoveryModule, located in the org.jboss.jbossts.xts.recovery.participant.at, as argument to both the register and unregister calls. This instance is responsible for identifying saved participant recovery records and recreating new, recovered participant instances.

Example 23.3. XTSATRecoveryModule Implementation

public interface XTSATRecoveryModule
{
    public Durable2PCParticipant
	deserialize(String id, ObjectInputStream stream) 
	  throws Exception;
    public Durable2PCParticipant
	recreate(String id, byte[] recoveryState) 
	  throws Exception;
}
If a participant's recovery state was saved using serialization, the recovery module's deserialize method is called to recreate the participant. Normally, the recovery module is required to read, cast, and return an object from the supplied input stream. If a participant's recovery state was saved using the PersistableATParticipant interface, the recovery module's recreate method is called to recreate the participant from the byte array it provided when the state was saved.
The XTS implementation cannot identify which participants belong to which recovery modules. A module only needs to return a participant instance if the recovery state belongs to the module's Web service. If the participant was created by another Web service, the module should return null. The participant identifier, which is supplied as argument to the deserialize or recreate method, is the identifier used by the Web service when the original participant was enlisted in the transaction. Web Services participating in recovery processing should ensure that participant identifiers are unique per service. If a module recognizes that a participant identifier belongs to its Web service, but cannot recreate the participant, it should throw an exception. This situation might arise if the service cannot associate the participant with any transactional information which is specific to the business logic.
Even if a module relies on serialization to create the participant recovery state saved by the XTS implementation, it still must be registered by the application. The deserialization operation must employ a class loader capable of loading classes specific to the Web service. XTS fulfills this requirement by devolving responsibility for the deserialize operation to the recovery module.