RHEL 5 or 6 cluster node "lost contact with quorum device"

Solution Verified - Updated 2018-08-08T22:48:05+00:00 -

Issue

A cluster node loses quorum after rebooting, removing, or fencing another node from the cluster
Quorum disk lost connectivity and caused a node reboot. What should the cman setting quorum_dev_poll be set to?
My cluster lost contact with the quorum device and cluster nodes Rebooted:

openais[11663]: [CMAN ] lost contact with quorum device
openais[11664]: [CMAN ] cman killed by node 1 because we were killed by cman_tool or other application

After disconnecting one path in a multipath map for the quorum device, the node loses quorum after "lost contact with quorum device" and the cluster services stop rather than continuing to run or relocating to another node, as expected.

Aug  1 03:32:23 node1 kernel: qla2xxx 0000:04:00.0: LOOP DOWN detected (4 3 0 0).
Aug  1 03:32:40 node1 openais[12015]: [logging.c:0042] lost contact with quorum device
Aug  1 03:32:40 node1 openais[12015]: [logging.c:0042] quorum lost, blocking activity
Aug  1 03:32:40 node1 clurgmgrd[12095]: <emerg> #1: Quorum Dissolved
Aug  1 03:32:40 node11 clurgmgrd[12095]: <debug> Emergency stop of service:myService

After a node is fenced, the standby node attempts to recover the service, but the operation fails after openais reports "lost contact with quorum device" and quorum is lost:

Feb 19 20:56:25 node2 fenced[8126]: fence "node1" success      
Feb 19 20:56:27 node2 clurgmgrd[12213]: <notice> Taking over service service:myService from down member node1
Feb 19 20:56:29 node2 qdiskd[8108]: <info> Assuming master role
Feb 19 20:56:30 node2 openais[8075]: [CMAN ] lost contact with quorum device    
Feb 19 20:56:30 node2 openais[8075]: [CMAN ] quorum lost, blocking activity
Feb 19 20:56:30 node2 clurgmgrd[12213]: <emerg> #1: Quorum Dissolved       
Feb 19 20:56:30 node2 ccsd[8069]: Cluster is not quorate.  Refusing connection.
Feb 19 20:56:30 node2 ccsd[8069]: Error while processing connect: Connection refused
Feb 19 20:56:30 node2 ccsd[8069]: Invalid descriptor specified (-111).
Feb 19 20:56:30 node2 ccsd[8069]: Someone may be attempting something evil.
Feb 19 20:56:30 node2 ccsd[8069]: Error while processing get: Invalid request descriptor
Feb 19 20:56:30 node2 ccsd[8069]: Invalid descriptor specified (-21).
Feb 19 20:56:30 node2 ccsd[8069]: Someone may be attempting something evil.
Feb 19 20:56:30 node2 ccsd[8069]: Error while processing disconnect: Invalid request descriptor
Feb 19 20:56:32 node2 qdiskd[8108]: <notice> Writing eviction notice for node 1
Feb 19 20:56:32 node2 openais[8075]: [CMAN ] quorum regained, resuming activity
Feb 19 20:56:33 node2 clurgmgrd[12213]: <err> #75: Failed changing service status

Server rebooted due to lost contact with quorum device in 2 node High Availability Cluster, what could be the root cause/reason of losing contact with qdiskd and restart?
We're testing network failures in a two-node cluster to see if resources failover, and if I disrupt the network on node 1, I see node 2 fence node 1, but then reports "Lost contact with quorum device" and stops its high availability resources.

Mar  9 18:59:03 node2 corosync[3006]:   [TOTEM ] A processor failed, forming new configuration.
Mar  9 18:59:05 node2 corosync[3006]:   [QUORUM] Members[1]: 2
Mar  9 18:59:05 node2 corosync[3006]:   [TOTEM ] A processor joined or left the membership and a new membership was formed.
Mar  9 18:59:05 node2 kernel: dlm: closing connection to node 1
Mar  9 18:59:05 node2 rgmanager[6308]: State change: node1.example.com DOWN
Mar  9 18:59:05 node2 fenced[3304]: fencing node node1.example.com
Mar  9 18:59:05 node2 kernel: GFS2: fsid=mycluster:clusterFS02.1: jid=0: Trying to acquire journal lock...
Mar  9 18:59:05 node2 kernel: GFS2: fsid=mycluster:clusterFS03.1: jid=0: Trying to acquire journal lock...
Mar  9 18:59:33 node2 fenced[3304]: fence node1.example.com success
Mar  9 18:59:36 node2 corosync[3006]: [CMAN  ] lost contact with quorum device
Mar  9 18:59:36 node2 corosync[3006]: [CMAN  ] quorum lost, blocking activity
Mar  9 18:59:36 node2 corosync[3006]: [QUORUM] This node is within the non-primary component and will NOT provide any services.
Mar  9 18:59:36 node2 corosync[3006]: [QUORUM] Members[1]: 2
Mar  9 18:59:36 node2 rgmanager #1: Quorum Dissolved
Mar  9 18:59:39 node2 rgmanager [ip] Removing IPv4 address 10.x.x.x/24 from eth0
Mar  9 18:59:46 node2 qdiskd[3062]: Assuming master role
Mar  9 18:59:47 node2 qdiskd[3062]: Writing eviction notice for node 1
Mar  9 18:59:47 node2 corosync[3006]:   [CMAN  ] quorum regained, resuming activity
Mar  9 18:59:47 node2 corosync[3006]:   [QUORUM] This node is within the primary component and will provide service.
Mar  9 18:59:47 node2 corosync[3006]:   [QUORUM] Members[1]: 2
Mar  9 18:59:47 node2 qdiskd[3062]: Node 1 evicted

One node fences the other then loses quorum after "Lost contact with quorum device", causing it to stop services
A node shows "Lost contact with quorum device" after fencing another, and GFS2 blocks
qdiskd does not contribute votes on the non-master node for a short time after it fences the master node

Environment

Red Hat Enterprise Linux (RHEL) 6 or 6 with the High Availability Add-On
Cluster utilizes a quorum device ("QDisk") - /etc/cluster/cluster.conf contains a <quorumd/> section

Subscriber exclusive content

A Red Hat subscription provides unlimited access to our knowledgebase, tools, and much more.

Select Your Language

RHEL 5 or 6 cluster node "lost contact with quorum device"

Issue

Environment

Subscriber exclusive content

Current Customers and Partners

New to Red Hat?

Using a Red Hat product through a public cloud?

Quick Links

Help

Site Info

Related Sites

About

Red Hat legal and privacy links

Red Hat legal and privacy links

Issue

Environment

Subscriber exclusive content

Current Customers and Partners

New to Red Hat?

Using a Red Hat product through a public cloud?

Quick Links

Help

Site Info

Related Sites

Systems Status

About

Red Hat legal and privacy links

Red Hat legal and privacy links