RHEL6.4: IBM AMQ multiple application nodes fail over at the same time unexpectedly due to xcsReadFileLock taking longer than 10 seconds

Solution Unverified - Updated -

Issue

We have a cluster of nodes comprised of an AMQ application that uses a 'master' file that it checks to see if it is accessible on an NFS mount. If this file is not present, or rather, no data from this file can be read into the stream, then the node assumes it is dead. At this same time we're at a point in which the other nodes, which are on standby, are attempting to get a file lock on this same file - if they manage to obtain the lock, they assume the primary node is dead and come online.

We're running across a situation in which we have a total failover of the AMQ master nodes to their secondaries - that is, multiple nodes failover at a time.

From IBM, we've obtained an AMQ trace that shows the function xcsReadFileLock is stuck in what appears to be a read() call (which xcsReadFileLock is a wrapper for):

 22:30:02.831896    26925.8           :       --{  xcsReadFileLock
 22:30:02.832148    26925.8           :       --}  xcsReadFileLock rc=OK
 22:30:12.832278    26925.8           :       --{  xcsReadFileLock
 22:30:12.832457    26925.8           :       --}  xcsReadFileLock rc=OK
 22:30:22.832614    26925.8           :       --{  xcsReadFileLock
 22:30:22.832797    26925.8           :       --}  xcsReadFileLock rc=OK
...

Environment

  • Red Hat Enterprise Linux 6 (NFS client)
    • kernel 2.6.32-358.20.1.el6
  • NetApp filer (NFS server)
  • NFS4
  • IBM Active MQ

Subscriber exclusive content

A Red Hat subscription provides unlimited access to our knowledgebase, tools, and much more.

Current Customers and Partners

Log in for full access

Log In

New to Red Hat?

Learn more about Red Hat subscriptions

Using a Red Hat product through a public cloud?

How to access this content