RHEL 5 Cluster node was evicted by qdiskd but did not log any messages before eviction indicating that its I/O was hung or failing
Issue
- My node was evicted by
qdiskdon another node, however the evicted node gave no indication its I/O was failing or hanging before the eviction. We know the node was still responsive (ie had not panicked or completely hung) becausecman/openaisreported being killed by the eviction:
Jul 30 02:24:54 node2 openais[22959]: [CMAN ] cman killed by node 1 because we were killed by cman_tool or other application
- But
qdiskdis supposed to warn us when it detects hung I/O or failed I/O, but neither shows up in the logs here.
Environment
- Red Hat Enterprise Linux (RHEL) 5 Update 4 through RHEL 5 Update 8 with the High Availability Add Ons
cmanreleases starting with2.0.115-1.el5up to (but not including)2.0.115-109.el5- Earlier releases than
cman-2.0.115-1.el5did not report any sort of warning for hung I/O inqdiskd, so it is expected prior to an eviction from stalled I/O to not see any indications of why - Later releases than
cman-2.0.115-109.el5have a feature to avoid evictions when I/O is hanging (https://access.redhat.com/knowledge/solutions/153223)
- Earlier releases than
- Cluster configured to use a quorum device (
<quorumd>in/etc/cluster/cluster.conf) -
Node evicted but
qdiskdon the evicted node does not report any of:qdiskd[XXX]: <warning> qdiskd: read (system call) has hung for YY secondsqdiskd[XXX]: <warning> qdiskd: write (system call) has hung for YY secondsqdiskd[XXX]: <error> Error writing to quorum disk - Evidence suggesting that the node was still alive and responsive up until the point it was evicted. For example, if it logs messages indicating it recognized it was evicted:
openais[22959]: [CMAN ] cman killed by node 1 because we were killed by cman_tool or other application
Subscriber exclusive content
A Red Hat subscription provides unlimited access to our knowledgebase, tools, and much more.