23.2. WS-BA Recovery

23.2.1. WS-BA Coordinator Crash Recovery

The WS-BA coordination service implementation tracks the status of each participant in an activity as the activity progresses through completion and closure. A transition point occurs during closure, once all CoordinatorCompletion participants receive a complete message and respond with a completed message. At this point, all ParticipantCompletion participants should have sent a completed message. The coordinator writes a log record storing the details of each participant, and indicating that the transaction is ready to close. If the coordinator service crashes after the log record is written, the close operation is still guaranteed to be successful. The coordinator checks the log after the system reboots and re sends a close message to all participants. After all participants respond to the close with a closed message, the coordinator can safely delete the log entry.
The coordinator does not need to account for any close messages sent before the crash, nor resend messages if it crashes several times. The XTS participant implementation is resilient to redelivery of close messages. Assuming that the participant has implemented the recovery functions described below, the coordinator can even guarantee delivery of close messages if both it, and one or more of the participant service hosts, crash simultaneously.
If the coordination service crashes before it has written the log record, it does not need to explicitly compensate any completed participants. The presumed abort protocol ensures that all completed participants are eventually sent a compensate message. Recovery must be initiated from the participant side.
A log record does not need to be written when an activity is being canceled. If a participant does not respond to a cancel or compensate request, the coordinator logs a warning and continues. The combination of the presumed abort protocol and participant-led recovery ensures that all participants eventually get canceled or compensated, as appropriate, even if the participant host crashes.
If a completed participant does not detect a response from its coordinator after resending its completed response a suitable number of times, it switches to sending getstatus messages, to determine whether the coordinator still knows about it. If a crash occurs before writing the log record, the coordinator has no record of the participant when the coordinator restarts, and the getstatus request returns a fault. The participant recovery manager automatically compensates the participant in this situation, just as if the activity had been canceled by the client.
After a participant crash, the participant recovery manager detects the log entries for each completed participant. It sends getstatus messages to each participant's coordinator host, to determine whether the activity still exists. If the coordinator has not crashed and the activity is still running, the participant switches back to resending completed messages, and waits for a close or compensate response. If the coordinator has also crashed or the activity has been canceled, the participant is automatically canceled.