LVM commands and GFS2 file systems are blocked following a membership transition and fencing of a node
Issue
-
When the node running our resource that uses GFS2 is powered-off, the that resource fails to start on the new node in a timely manner.
-
GFS2 is blocked after a node gets fenced
-
When stonith fails once but then succeeds on the retry, GFS2 file systems and
clvmdseem to remain blocked on the remaining node in the cluster until the fence node rejoins -
After fencing, we see messages from
dlm_controldindicating errors anddlm_tool lssayswait fencingfor some lockspaces
Aug 24 18:21:22 node1 dlm_controld[348]: 1845218 fence wait 2 pid 16002 running
Aug 24 18:21:22 node1 dlm_controld[348]: 1845218 clvmd wait for fencing
[...]
Aug 24 18:21:24 node1 stonith-api[16002]: stonith_api_kick: Could not kick (reboot) node 2/(null) : Timer expired (-62)
Aug 24 18:21:24 node1 dlm_stonith[16002]: kick_helper error -62 nodeid 2
Aug 24 18:21:24 node1 crmd[540]: notice: Stonith operation 6/89:6803:0:92a6022f-ae0d-47c3-bb72-9bfb9dd6bf51: Timer expired (-62)
Aug 24 18:21:24 node1 crmd[540]: notice: Stonith operation 6 for node2 failed (Timer expired): aborting transition.
[...]
Aug 24 18:21:25 node1 dlm_controld[348]: 1845221 fence result 2 pid 16002 result 194 exit status
Aug 24 18:21:25 node1 dlm_controld[348]: 1845221 fence status 2 receive 194 from 1 walltime 1472077285 local 1845221
Aug 24 18:21:25 node1 dlm_controld[348]: 1845221 fence request 2 no actor
[...]
Aug 24 18:22:34 node1 stonith-ng[531]: notice: Operation 'reboot' [17416] (call 8 from crmd.540) for host 'node2' with device 'xvmfence' returned: 0 (OK)
Aug 24 18:22:34 node1 stonith-ng[531]: notice: Operation reboot of node2 by node1 for crmd.540@node1.81d4337b: OK
Aug 24 18:22:34 node1 crmd[540]: notice: Stonith operation 8/89:6805:0:92a6022f-ae0d-47c3-bb72-9bfb9dd6bf51: OK (0)
Aug 24 18:22:34 node1 crmd[540]: notice: Peer node2 was terminated (reboot) by node1 for node1: OK (ref=81d4337b-b643-4305-9963-7b012a48d35a) by client crmd.540
Environment
- Red Hat Enterprise Linux with the Resilient Storage Add On
- One or more applications or services using DLM. Such situations that qualify are:
- A
controldresource is managed by the cluster - One or more GFS2 file systems is mounted in the cluster, possibly through a
Filesystemresource managed by the cluster
- A
- A
stonithlayout that may result in a fencing operation that fails one or more times before succeeding- This often occurs as a result of fencing operation timeouts, so environments with slow-to-respond stonith devices may be more at risk
Subscriber exclusive content
A Red Hat subscription provides unlimited access to our knowledgebase, tools, and much more.