Node with an unresponsive quorum device can race to fence the other node in a RHEL cluster
Issue
-
When the quorum disk and token updates from a node stop at the same time (such as with the quorum device on an iSCSI LUN, and the network goes down), the node with the problem may fence the still-healthy node.
-
Fencing happens before the cluster node checks if it is quorate
Sep 16 23:24:49 hostname1 qdiskd: read (system call) has hung for 2 seconds Sep 16 23:24:49 hostname1 In 2 more seconds, we will be evicted Sep 16 23:24:56 corosync [TOTEM ] A processor failed, forming new configuration. Sep 16 23:24:56 corosync [TOTEM ] The network interface is down. Sep 16 23:24:58 corosync [QUORUM] Members[1]: 1 Sep 16 23:24:58 corosync [TOTEM ] A processor joined or left the membership and a new membership was formed. Sep 16 23:24:58 corosync [CPG ] downlist received left_list: 1 Sep 16 23:24:58 corosync [CPG ] chosen downlist from node r(0) ip(127.0.0.1) Sep 16 23:24:58 corosync [MAIN ] Completed service synchronization, ready to provide service. Sep 16 23:24:58 fenced fencing node node2 Sep 16 23:25:00 corosync [CMAN ] lost contact with quorum device Sep 16 23:25:00 corosync [CMAN ] quorum lost, blocking activity Sep 16 23:25:00 corosync [QUORUM] This node is within the non-primary component and will NOT provide any services. Sep 16 23:25:00 corosync [QUORUM] Members[1]: 1
Environment
- Red Hat Enterprise Linux (RHEL) 5 Advanced Platform (Clustering)
- Red Hat Enterprise Linux (RHEL) Server 6 (with the High Availability Add on)
Subscriber exclusive content
A Red Hat subscription provides unlimited access to our knowledgebase, tools, and much more.