RHEL 7 High Availability and Resilient Storage Pacemaker cluster experiences a fence race condition between nodes during network outages while using fence_scsi with multipath storage

Solution Verified - Updated -

Issue

RHEL 7 Pacemaker cluster nodes can experience a fence race condition when utilizing fence_scsi with multipath storage, resulting in nodes partially fencing one another resulting in fence failure on multiple nodes:

Node 1 fenced 2 of 3 shared devices, but failed to fence the 3rd:

Sep 28 17:25:03 node1 crmd[2654]:  notice: Requesting fencing (reboot) of node node2
Sep 28 17:25:03 node1 stonith-ng[2650]:  notice: Client crmd.2654.3f6ab593 wants to fence (reboot) 'node2' with device '(any)'
Sep 28 17:25:03 node1 stonith-ng[2650]:  notice: Requesting peer fencing (reboot) of node2
Sep 28 17:25:03 node1 stonith-ng[2650]:  notice: fence_scsi can fence (reboot) node2: static-list
Sep 28 17:25:03 node1 stonith-ng[2650]:  notice: fence_scsi can fence (reboot) node2: static-list
...
Sep 28 17:25:05 node1 fence_scsi: Failed to verify 1 device(s)
Sep 28 17:25:05 node1 stonith-ng[2650]: warning: fence_scsi[9357] stderr: [ WARNING:root:Parse error: Ignoring unknown option 'port=node2' ]
Sep 28 17:25:05 node1 stonith-ng[2650]: warning: fence_scsi[9357] stderr: [  ]
Sep 28 17:25:05 node1 stonith-ng[2650]: warning: fence_scsi[9357] stderr: [ ERROR:root:Failed to verify 1 device(s) ]
Sep 28 17:25:05 node1 stonith-ng[2650]: warning: fence_scsi[9357] stderr: [ Failed to verify 1 device(s) ]
Sep 28 17:25:05 node1 stonith-ng[2650]:   error: Operation 'reboot' [9357] (call 13 from crmd.2654) for host 'node2' with device 'fence_scsi' returned: -201 (Generic Pacemaker error)
Sep 28 17:25:05 node1 stonith-ng[2650]:  notice: Couldn't find anyone to fence (reboot) node2 with any device
Sep 28 17:25:05 node1 stonith-ng[2650]:   error: Operation reboot of node2 by <no-one> for crmd.2654@node1.fe23c10a: No route to host
Sep 28 17:25:05 node1 crmd[2654]:  notice: Stonith operation 13/88:12:0:376e05b2-17dc-4978-9e12-24373846add2: No route to host (-113)
Sep 28 17:25:05 node1 crmd[2654]:  notice: Stonith operation 13 for node2 failed (No route to host): aborting transition.
Sep 28 17:25:05 node1 crmd[2654]: warning: Too many failures (11) to fence node2, giving up
Sep 28 17:25:05 node1 crmd[2654]:  notice: Transition aborted: Stonith failed

Node 2 fenced the 3rd device, but fails to fence the other 2:

Sep 28 17:55:07 node2 crmd[14317]:  notice: Requesting fencing (reboot) of node node1
Sep 28 17:55:07 node2 stonith-ng[14313]:  notice: Client crmd.14317.56ab0693 wants to fence (reboot) 'node1' with device '(any)'
Sep 28 17:55:07 node2 stonith-ng[14313]:  notice: Requesting peer fencing (reboot) of node1
Sep 28 17:55:07 node2 stonith-ng[14313]:  notice: fence_scsi can fence (reboot) node1: static-list
Sep 28 17:55:07 node2 stonith-ng[14313]:  notice: fence_scsi can fence (reboot) node1: static-list
Sep 28 17:55:07 node2 stonith-ng[14313]: warning: Agent 'fence_scsi' does not advertise support for 'reboot', performing 'off' action instead
Sep 28 17:55:07 node2 fence_scsi: Failed to verify 2 device(s)
Sep 28 17:55:07 node2 stonith-ng[14313]: warning: fence_scsi[47186] stderr: [ WARNING:root:Parse error: Ignoring unknown option 'port=node1' ]
Sep 28 17:55:07 node2 stonith-ng[14313]: warning: fence_scsi[47186] stderr: [  ]
Sep 28 17:55:07 node2 stonith-ng[14313]: warning: fence_scsi[47186] stderr: [ ERROR:root:Failed to verify 2 device(s) ]
Sep 28 17:55:07 node2 stonith-ng[14313]: warning: fence_scsi[47186] stderr: [ Failed to verify 2 device(s) ]
Sep 28 17:55:08 node2 fence_scsi: Failed to verify 2 device(s)
Sep 28 17:55:08 node2 stonith-ng[14313]: warning: fence_scsi[47261] stderr: [ WARNING:root:Parse error: Ignoring unknown option 'port=node1' ]
Sep 28 17:55:08 node2 stonith-ng[14313]: warning: fence_scsi[47261] stderr: [  ]
Sep 28 17:55:08 node2 stonith-ng[14313]: warning: fence_scsi[47261] stderr: [ ERROR:root:Failed to verify 2 device(s) ]
Sep 28 17:55:08 node2 stonith-ng[14313]: warning: fence_scsi[47261] stderr: [ Failed to verify 2 device(s) ]
Sep 28 17:55:08 node2 stonith-ng[14313]:   error: Operation 'reboot' [47261] (call 16 from crmd.14317) for host 'node1' with device 'fence_scsi' returned: -201 (Generic Pacemaker error)
Sep 28 17:55:09 node2 stonith-ng[14313]:  notice: Couldn't find anyone to fence (reboot) node1 with any device
Sep 28 17:55:09 node2 stonith-ng[14313]:   error: Operation reboot of node1 by <no-one> for crmd.14317@node2.3d1792b3: No route to host
Sep 28 17:55:09 node2 crmd[14317]:  notice: Stonith operation 16/33:17:0:c442aa93-5aa0-4a93-9a15-a2a606880900: No route to host (-113)
Sep 28 17:55:09 node2 crmd[14317]:  notice: Stonith operation 16 for node1 failed (No route to host): aborting transition.
Sep 28 17:55:09 node2 crmd[14317]: warning: Too many failures (12) to fence node1, giving up
Sep 28 17:55:09 node2 crmd[14317]:  notice: Transition aborted: Stonith failed

Environment

  • Red Hat Enterprise Linux 7 w/ High Availability or Pacemaker
  • multipath
  • fence_scsi

Subscriber exclusive content

A Red Hat subscription provides unlimited access to our knowledgebase of over 48,000 articles and solutions.

Current Customers and Partners

Log in for full access

Log In