worker marked as zombie and TX scheduled for mark-as-rollback

Solution Unverified - Updated -

Environment

  • Red Hat JBoss Enterprise Application Platform (EAP) 5

Issue

  • Repeated logging of transactions being zombie

    WARN  [arjLoggerI18N] [com.arjuna.ats.arjuna.coordinator.TransactionReaper_18] - TransactionReaper::check timeout for TX a0a0a6f:adfa:4fb88f10:1e9fe4f in state  RUN
    WARN  [arjLoggerI18N] [com.arjuna.ats.arjuna.coordinator.BasicAction_58] - Abort of action id a0a0a6f:adfa:4fb88f10:1e9fe4f invoked while multiple threads active within it.
    WARN  [arjLoggerI18N] [com.arjuna.ats.arjuna.coordinator.CheckedAction_2] - CheckedAction::check - atomic action a0a0a6f:adfa:4fb88f10:1e9fe4f aborting with 1 threads active!
    WARN  [arjLoggerI18N] [com.arjuna.ats.arjuna.coordinator.TransactionReaper_18] - TransactionReaper::check timeout for TX a0a0a6f:adfa:4fb88f10:1e9fe4f in state  CANCEL
    WARN  [arjLoggerI18N] [com.arjuna.ats.arjuna.coordinator.TransactionReaper_18] - TransactionReaper::check timeout for TX a0a0a6f:adfa:4fb88f10:1e9fe4f in state  CANCEL_INTERRUPTED
    WARN  [arjLoggerI18N] [com.arjuna.ats.arjuna.coordinator.TransactionReaper_6] - TransactionReaper::check worker Thread[Thread-4021678,5,jboss] not responding to interrupt when cancelling TX a0a0a6f:adfa:4fb88f10:1e9fe4f -- worker marked as zombie and TX scheduled for mark-as-rollback
    WARN  [arjLoggerI18N] [com.arjuna.ats.arjuna.coordinator.TransactionReaper_11] - TransactionReaper::check failed to mark TX a0a0a6f:adfa:4fb88f10:1e9fe4f  as rollback only
    ...
    ERROR [arjLoggerI18N] [com.arjuna.ats.arjuna.coordinator.TransactionReaper_5] - TransactionReaper::check worker zombie count 8 exceeds specified limit
    ...
    WARN  [arjLoggerI18N] [com.arjuna.ats.arjuna.coordinator.TransactionReaper_13] - TransactionReaper::doCancellations worker Thread[Thread-5189812,5,jboss] missed interrupt when cancelling TX a0a0a6f:adfa:4fb88f10:1e9feeb -- exiting as zombie (zombie count decremented to 1)
    WARN  [arjLoggerI18N] [com.arjuna.ats.arjuna.coordinator.TransactionReaper_13] - TransactionReaper::doCancellations worker Thread[Thread-5189960,5,jboss] missed interrupt when cancelling TX a0a0a6f:adfa:4fb88f10:1ea05a7 -- exiting as zombie (zombie count decremented to 0)
    
  • What is causing transactions to become zombie

  • Can zombie transactions be ignored

Resolution

Seeing this warning once in a while is not a major concern, but if it occurs frequently the container will eventually either stop processing timeouts or run out of threads and bring down the JVM.

Unfortunately there is not much that can be done to work around such limitation in the driver design. Some expose a query timeout configuration that can be used to set a max query runtime on the database server. This is much the same as a transaction timeout but being server side rather than driver side, does not suffer from the network I/O scheduling problem.

Keep an eye on the zombie count and ensure it does not grow too big. For example:

ERROR [arjLoggerI18N] [com.arjuna.ats.arjuna.coordinator.TransactionReaper_5] - TransactionReaper::check worker zombie count 8 exceeds specified limit

Root Cause

Transactions are subject to a timeout, intended to insure that poorly written or under-performing code can't compromise throughput by holding data locks indefinitely. The reaper is a background process in the transaction manager which is responsible for rolling back i.e. aborting transaction instances that reach their timeout.

The most common scenario is a transaction that is blocked on network I/O to a database server due to a long running query. Pretty much all database drivers use traditional blocking I/O which is non-interruptible. For this reason transaction timeouts are processed on background threads rather than by interrupting the business logic thread. The reaper hands off an expired transaction instance to a worker thread drawn from a pool and that thread calls rollback on the transaction. This results in the enlisted XAResources being invoked to inform the resource managers e.g. databases, of the rollback.

Here is where it gets hairy: those calls themselves typically involve blocking network I/O. Thus the reaper framework monitors the worker threads and tries to identify any that are stuck. These are termed zombies. There is essentially a fallback error handling pattern that tries to maintain system stability by removing stuck workers from the pool, retrying blocked steps and such. Seeing this warning once in a while is not a major concern, but if it occurs frequently the container will eventually either stop processing timeouts or run out of threads and bring down the JVM. Keep an eye on the zombie count and ensure it does not grow too big.

Root cause analysis is tricky at best, but in general it's attributable to XAResource implementations from e.g. database drivers that multiplex SQL and XA over the same network connection in a serial fashion. In such drivers the xaResource.rollback issued via the reaper worker thread gets queued behind the long running SQL statement that caused the timeout, rather than 'interrupting' it. Such situations are actually self-correcting eventually, as the rollback will be processed once the SQL statement completes. That's normally enough to keep the system stable, but not helpful in terms of preserving high throughput.

Diagnostic Steps

  • Thread dump analysis would be required to validate the hypothesis that the reaper's transaction rollback request was executed in a thread that is blocked by the same database query that it is attempting to rollback. You'd most likely see the reaper worker thread blocked in a 3rd party driver's XAResource code.

This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.

Comments