LVM commands and GFS2 file systems are blocked following a membership transition and fencing of a node in a RHEL 7 Resilient Storage cluster
Issue
- When the node running our resource that uses GFS2 is powered-off, the that resource fails to start on the new node in a timely manner.
- GFS2 is blocked after a node gets fenced
- When stonith fails once but then succeeds on the retry, GFS2 file systems and
clvmdseem to remain blocked on the remaining node in the cluster until the fence node rejoins - After fencing, we see messages from
dlm_controldindicating errors anddlm_tool lssayswait fencingfor some lockspaces
Aug 24 18:21:22 node1 dlm_controld[348]: 1845218 fence wait 2 pid 16002 running
Aug 24 18:21:22 node1 dlm_controld[348]: 1845218 clvmd wait for fencing
[...]
Aug 24 18:21:24 node1 stonith-api[16002]: stonith_api_kick: Could not kick (reboot) node 2/(null) : Timer expired (-62)
Aug 24 18:21:24 node1 dlm_stonith[16002]: kick_helper error -62 nodeid 2
Aug 24 18:21:24 node1 crmd[540]: notice: Stonith operation 6/89:6803:0:92a6022f-ae0d-47c3-bb72-9bfb9dd6bf51: Timer expired (-62)
Aug 24 18:21:24 node1 crmd[540]: notice: Stonith operation 6 for node2 failed (Timer expired): aborting transition.
[...]
Aug 24 18:21:25 node1 dlm_controld[348]: 1845221 fence result 2 pid 16002 result 194 exit status
Aug 24 18:21:25 node1 dlm_controld[348]: 1845221 fence status 2 receive 194 from 1 walltime 1472077285 local 1845221
Aug 24 18:21:25 node1 dlm_controld[348]: 1845221 fence request 2 no actor
[...]
Aug 24 18:22:34 node1 stonith-ng[531]: notice: Operation 'reboot' [17416] (call 8 from crmd.540) for host 'node2' with device 'xvmfence' returned: 0 (OK)
Aug 24 18:22:34 node1 stonith-ng[531]: notice: Operation reboot of node2 by node1 for crmd.540@node1.81d4337b: OK
Aug 24 18:22:34 node1 crmd[540]: notice: Stonith operation 8/89:6805:0:92a6022f-ae0d-47c3-bb72-9bfb9dd6bf51: OK (0)
Aug 24 18:22:34 node1 crmd[540]: notice: Peer node2 was terminated (reboot) by node1 for node1: OK (ref=81d4337b-b643-4305-9963-7b012a48d35a) by client crmd.540
Environment
- Red Hat Enterprise Linux (RHEL) 7 with the Resilient Storage Add On
- One or more applications or services using DLM. Such situations that qualify are:
- A
controldresource is managed by the cluster - One or more GFS2 file systems is mounted in the cluster, possibly through a
Filesystemresource managed by the cluster
- A
- A
stonithlayout that may result in a fencing operation that fails one or more times before succeeding- This often occurs as a result of fencing operation timeouts, so environments with slow-to-respond stonith devices may be more at risk
Subscriber exclusive content
A Red Hat subscription provides unlimited access to our knowledgebase of over 48,000 articles and solutions.
Welcome! Check out the Getting Started with Red Hat page for quick tours and guides for common tasks.
