Node with an unresponsive quorum device can race to fence the other node in a RHEL cluster

Solution Unverified - Updated -

Issue

  • When the quorum disk and token updates from a node stop at the same time (such as with the quorum device on an iSCSI LUN, and the network goes down), the node with the problem may fence the still-healthy node. 

  • Fencing happens before the cluster node checks if it is quorate

    Sep 16 23:24:49 hostname1 qdiskd: read (system call) has hung for 2 seconds
    Sep 16 23:24:49 hostname1 In 2 more seconds, we will be evicted
    Sep 16 23:24:56 corosync [TOTEM ] A processor failed, forming new configuration.
    Sep 16 23:24:56 corosync [TOTEM ] The network interface is down.
    Sep 16 23:24:58 corosync [QUORUM] Members[1]: 1
    Sep 16 23:24:58 corosync [TOTEM ] A processor joined or left the membership and a new membership was formed.
    Sep 16 23:24:58 corosync [CPG   ] downlist received left_list: 1
    Sep 16 23:24:58 corosync [CPG   ] chosen downlist from node r(0) ip(127.0.0.1)
    Sep 16 23:24:58 corosync [MAIN  ] Completed service synchronization, ready to provide service.
    Sep 16 23:24:58 fenced fencing node node2
    Sep 16 23:25:00 corosync [CMAN  ] lost contact with quorum device
    Sep 16 23:25:00 corosync [CMAN  ] quorum lost, blocking activity
    Sep 16 23:25:00 corosync [QUORUM] This node is within the non-primary component and will NOT provide any services.
    Sep 16 23:25:00 corosync [QUORUM] Members[1]: 1
    

Environment

  • Red Hat Enterprise Linux (RHEL) 5 Advanced Platform (Clustering)
  • Red Hat Enterprise Linux (RHEL) Server 6 (with the High Availability Add on)

Subscriber exclusive content

A Red Hat subscription provides unlimited access to our knowledgebase, tools, and much more.

Current Customers and Partners

Log in for full access

Log In

New to Red Hat?

Learn more about Red Hat subscriptions

Using a Red Hat product through a public cloud?

How to access this content