Node with an unresponsive quorum device can race to fence the other node in a RHEL cluster
Issue
-
When the quorum disk and token updates from a node stop at the same time (such as with the quorum device on an iSCSI LUN, and the network goes down), the node with the problem may fence the still-healthy node.
-
Fencing happens before the cluster node checks if it is quorate
Sep 16 23:24:49 hostname1 qdiskd: read (system call) has hung for 2 seconds Sep 16 23:24:49 hostname1 In 2 more seconds, we will be evicted Sep 16 23:24:56 corosync [TOTEM ] A processor failed, forming new configuration. Sep 16 23:24:56 corosync [TOTEM ] The network interface is down. Sep 16 23:24:58 corosync [QUORUM] Members[1]: 1 Sep 16 23:24:58 corosync [TOTEM ] A processor joined or left the membership and a new membership was formed. Sep 16 23:24:58 corosync [CPG ] downlist received left_list: 1 Sep 16 23:24:58 corosync [CPG ] chosen downlist from node r(0) ip(127.0.0.1) Sep 16 23:24:58 corosync [MAIN ] Completed service synchronization, ready to provide service. Sep 16 23:24:58 fenced fencing node node2 Sep 16 23:25:00 corosync [CMAN ] lost contact with quorum device Sep 16 23:25:00 corosync [CMAN ] quorum lost, blocking activity Sep 16 23:25:00 corosync [QUORUM] This node is within the non-primary component and will NOT provide any services. Sep 16 23:25:00 corosync [QUORUM] Members[1]: 1
Environment
- Red Hat Enterprise Linux (RHEL) 5 Advanced Platform (Clustering)
- Red Hat Enterprise Linux (RHEL) Server 6 (with the High Availability Add on)
Subscriber exclusive content
A Red Hat subscription provides unlimited access to our knowledgebase of over 48,000 articles and solutions.
Welcome! Check out the Getting Started with Red Hat page for quick tours and guides for common tasks.
