LVM commands and GFS2 file systems are blocked following a membership transition and fencing of a node in a RHEL 7 Resilient Storage cluster

Solution Unverified - Updated -

Issue

  • When the node running our resource that uses GFS2 is powered-off, the that resource fails to start on the new node in a timely manner.
  • GFS2 is blocked after a node gets fenced
  • When stonith fails once but then succeeds on the retry, GFS2 file systems and clvmd seem to remain blocked on the remaining node in the cluster until the fence node rejoins
  • After fencing, we see messages from dlm_controld indicating errors and dlm_tool ls says wait fencing for some lockspaces
Aug 24 18:21:22 node1 dlm_controld[348]: 1845218 fence wait 2 pid 16002 running
Aug 24 18:21:22 node1 dlm_controld[348]: 1845218 clvmd wait for fencing
[...]
Aug 24 18:21:24 node1 stonith-api[16002]: stonith_api_kick: Could not kick (reboot) node 2/(null) : Timer expired (-62)
Aug 24 18:21:24 node1 dlm_stonith[16002]: kick_helper error -62 nodeid 2
Aug 24 18:21:24 node1 crmd[540]:  notice: Stonith operation 6/89:6803:0:92a6022f-ae0d-47c3-bb72-9bfb9dd6bf51: Timer expired (-62)
Aug 24 18:21:24 node1 crmd[540]:  notice: Stonith operation 6 for node2 failed (Timer expired): aborting transition.
[...]
Aug 24 18:21:25 node1 dlm_controld[348]: 1845221 fence result 2 pid 16002 result 194 exit status
Aug 24 18:21:25 node1 dlm_controld[348]: 1845221 fence status 2 receive 194 from 1 walltime 1472077285 local 1845221
Aug 24 18:21:25 node1 dlm_controld[348]: 1845221 fence request 2 no actor
[...]
Aug 24 18:22:34 node1 stonith-ng[531]:  notice: Operation 'reboot' [17416] (call 8 from crmd.540) for host 'node2' with device 'xvmfence' returned: 0 (OK)
Aug 24 18:22:34 node1 stonith-ng[531]:  notice: Operation reboot of node2 by node1 for crmd.540@node1.81d4337b: OK
Aug 24 18:22:34 node1 crmd[540]:  notice: Stonith operation 8/89:6805:0:92a6022f-ae0d-47c3-bb72-9bfb9dd6bf51: OK (0)
Aug 24 18:22:34 node1 crmd[540]:  notice: Peer node2 was terminated (reboot) by node1 for node1: OK (ref=81d4337b-b643-4305-9963-7b012a48d35a) by client crmd.540

Environment

  • Red Hat Enterprise Linux (RHEL) 7 with the Resilient Storage Add On
  • One or more applications or services using DLM. Such situations that qualify are:
    • A controld resource is managed by the cluster
    • One or more GFS2 file systems is mounted in the cluster, possibly through a Filesystem resource managed by the cluster
  • A stonith layout that may result in a fencing operation that fails one or more times before succeeding
    • This often occurs as a result of fencing operation timeouts, so environments with slow-to-respond stonith devices may be more at risk

Subscriber exclusive content

A Red Hat subscription provides unlimited access to our knowledgebase of over 48,000 articles and solutions.

Current Customers and Partners

Log in for full access

Log In
Close

Welcome! Check out the Getting Started with Red Hat page for quick tours and guides for common tasks.