RHEL6.4: IBM AMQ multiple application nodes fail over at the same time unexpectedly due to xcsReadFileLock taking longer than 10 seconds
Issue
We have a cluster of nodes comprised of an AMQ application that uses a 'master' file that it checks to see if it is accessible on an NFS mount. If this file is not present, or rather, no data from this file can be read into the stream, then the node assumes it is dead. At this same time we're at a point in which the other nodes, which are on standby, are attempting to get a file lock on this same file - if they manage to obtain the lock, they assume the primary node is dead and come online.
We're running across a situation in which we have a total failover of the AMQ master nodes to their secondaries - that is, multiple nodes failover at a time.
From IBM, we've obtained an AMQ trace that shows the function xcsReadFileLock is stuck in what appears to be a read() call (which xcsReadFileLock is a wrapper for):
22:30:02.831896 26925.8 : --{ xcsReadFileLock
22:30:02.832148 26925.8 : --} xcsReadFileLock rc=OK
22:30:12.832278 26925.8 : --{ xcsReadFileLock
22:30:12.832457 26925.8 : --} xcsReadFileLock rc=OK
22:30:22.832614 26925.8 : --{ xcsReadFileLock
22:30:22.832797 26925.8 : --} xcsReadFileLock rc=OK
...
Environment
- Red Hat Enterprise Linux 6 (NFS client)
- kernel 2.6.32-358.20.1.el6
- NetApp filer (NFS server)
- NFS4
- IBM Active MQ
Subscriber exclusive content
A Red Hat subscription provides unlimited access to our knowledgebase, tools, and much more.