9.8. Ways in Which You Can Minimize the Risk of Failures

Here are some ways in which you can minimize the risk of data loss through failures:
  • Try to develop stateless and idempotent services. If this is not possible, use MessageID to identify Messages so your application can detect retransmission attempts. If retrying Message transmission, use the same MessageID. Services that are not idempotent and would suffer from redoing the same work if they receive a retransmitted Message, should record state transitions against the MessageID, preferably using transactions. Applications based around stateless services tend to scale better as well.
  • If developing stateful services, use transactions and a (preferably clustered) JMS implementation.
  • Cluster your Registry and use a clustered/fault-tolerant back-end database, to remove any single points of failure.
  • Ensure that the Message Store is backed by a highly available database.
  • Clearly identify which services and which operations on services need higher reliability and fault tolerance capabilities than others. This will allow you to target transports other than JMS at those services, potentially improving the overall performance of applications. Because JBossESB allows services to be used through different EPRs concurrently, it is also possible to offer these different qualities of service (QoS) to different consumers based on application specific requirements.
  • Because network partitions can make services appear as though they have failed, avoid transports that are more prone to this type of failure for services that cannot cope with being misidentified as having crashed.
  • In some situations (for example, HTTP) the crash of a server after it has dealt with a message but before it has responded could result in another server doing the same work. This is because it is not possible to differentiate between a machine that fails after the service receives the message and process it, and one where it receives the message and doesn't process it.
  • Using asynchronous (one-way) delivery patterns will make it difficult to detect failures of services: there is typically no notion of a lost or delayed Message if responses to requests can come at arbitrary times. If there are no responses at all, then it obviously makes failure detection more problematical and you may have to rely upon application semantics to determine that Messages did not arrive (for example, the amount of money in the bank account does not match expectations). When using either the ServiceInvoker or Couriers to delivery asynchronous Messages, a return from the respective operation (e.g., deliverAsync) does not mean the Message has been acted upon by the service.
  • The message store is used by the redelivery protocol. However, as mentioned this is a best-effort protocol for improved robustness and does not use transactions or reliable message delivery. This means that certain failures may result in messages being lost entirely (they do not get written to the store before a crash), or delivered multiple times (the redelivery mechanism pulls a message from the store, delivers it successfully but there is a crash that prevents the message from being removed from the store.) Upon recovery the message will be delivered again.
  • Some transports, such as FTP, can be configured to retain messages that have been processed, although they will be uniquely marked to differentiate them from un-processed messages. The default approach is often to delete messages once they have been processed, but you may want to change this default to allow your applications to determine which messages have been dealt with upon recovery from failures.