RHEL 7 High Availability and Resilient Storage Pacemaker cluster experiences a fence race condition between nodes during network outages while using fence_scsi with multipath storage

Solution Verified - Updated -

Issue

RHEL 7 Pacemaker cluster nodes can experience a fence race condition when utilizing fence_scsi with multipath storage, resulting in nodes partially fencing one another resulting in fence failure on multiple nodes:

Node 1 fenced 2 of 3 shared devices, but failed to fence the 3rd:

Sep 28 17:25:03 node1 crmd[2654]:  notice: Requesting fencing (reboot) of node node2
Sep 28 17:25:03 node1 stonith-ng[2650]:  notice: Client crmd.2654.3f6ab593 wants to fence (reboot) 'node2' with device '(any)'
Sep 28 17:25:03 node1 stonith-ng[2650]:  notice: Requesting peer fencing (reboot) of node2
Sep 28 17:25:03 node1 stonith-ng[2650]:  notice: fence_scsi can fence (reboot) node2: static-list
Sep 28 17:25:03 node1 stonith-ng[2650]:  notice: fence_scsi can fence (reboot) node2: static-list
...
Sep 28 17:25:05 node1 fence_scsi: Failed to verify 1 device(s)
Sep 28 17:25:05 node1 stonith-ng[2650]: warning: fence_scsi[9357] stderr: [ WARNING:root:Parse error: Ignoring unknown option 'port=node2' ]
Sep 28 17:25:05 node1 stonith-ng[2650]: warning: fence_scsi[9357] stderr: [  ]
Sep 28 17:25:05 node1 stonith-ng[2650]: warning: fence_scsi[9357] stderr: [ ERROR:root:Failed to verify 1 device(s) ]
Sep 28 17:25:05 node1 stonith-ng[2650]: warning: fence_scsi[9357] stderr: [ Failed to verify 1 device(s) ]
Sep 28 17:25:05 node1 stonith-ng[2650]:   error: Operation 'reboot' [9357] (call 13 from crmd.2654) for host 'node2' with device 'fence_scsi' returned: -201 (Generic Pacemaker error)
Sep 28 17:25:05 node1 stonith-ng[2650]:  notice: Couldn't find anyone to fence (reboot) node2 with any device
Sep 28 17:25:05 node1 stonith-ng[2650]:   error: Operation reboot of node2 by <no-one> for crmd.2654@node1.fe23c10a: No route to host
Sep 28 17:25:05 node1 crmd[2654]:  notice: Stonith operation 13/88:12:0:376e05b2-17dc-4978-9e12-24373846add2: No route to host (-113)
Sep 28 17:25:05 node1 crmd[2654]:  notice: Stonith operation 13 for node2 failed (No route to host): aborting transition.
Sep 28 17:25:05 node1 crmd[2654]: warning: Too many failures (11) to fence node2, giving up
Sep 28 17:25:05 node1 crmd[2654]:  notice: Transition aborted: Stonith failed

Node 2 fenced the 3rd device, but fails to fence the other 2:

Sep 28 17:55:07 node2 crmd[14317]:  notice: Requesting fencing (reboot) of node node1
Sep 28 17:55:07 node2 stonith-ng[14313]:  notice: Client crmd.14317.56ab0693 wants to fence (reboot) 'node1' with device '(any)'
Sep 28 17:55:07 node2 stonith-ng[14313]:  notice: Requesting peer fencing (reboot) of node1
Sep 28 17:55:07 node2 stonith-ng[14313]:  notice: fence_scsi can fence (reboot) node1: static-list
Sep 28 17:55:07 node2 stonith-ng[14313]:  notice: fence_scsi can fence (reboot) node1: static-list
Sep 28 17:55:07 node2 stonith-ng[14313]: warning: Agent 'fence_scsi' does not advertise support for 'reboot', performing 'off' action instead
Sep 28 17:55:07 node2 fence_scsi: Failed to verify 2 device(s)
Sep 28 17:55:07 node2 stonith-ng[14313]: warning: fence_scsi[47186] stderr: [ WARNING:root:Parse error: Ignoring unknown option 'port=node1' ]
Sep 28 17:55:07 node2 stonith-ng[14313]: warning: fence_scsi[47186] stderr: [  ]
Sep 28 17:55:07 node2 stonith-ng[14313]: warning: fence_scsi[47186] stderr: [ ERROR:root:Failed to verify 2 device(s) ]
Sep 28 17:55:07 node2 stonith-ng[14313]: warning: fence_scsi[47186] stderr: [ Failed to verify 2 device(s) ]
Sep 28 17:55:08 node2 fence_scsi: Failed to verify 2 device(s)
Sep 28 17:55:08 node2 stonith-ng[14313]: warning: fence_scsi[47261] stderr: [ WARNING:root:Parse error: Ignoring unknown option 'port=node1' ]
Sep 28 17:55:08 node2 stonith-ng[14313]: warning: fence_scsi[47261] stderr: [  ]
Sep 28 17:55:08 node2 stonith-ng[14313]: warning: fence_scsi[47261] stderr: [ ERROR:root:Failed to verify 2 device(s) ]
Sep 28 17:55:08 node2 stonith-ng[14313]: warning: fence_scsi[47261] stderr: [ Failed to verify 2 device(s) ]
Sep 28 17:55:08 node2 stonith-ng[14313]:   error: Operation 'reboot' [47261] (call 16 from crmd.14317) for host 'node1' with device 'fence_scsi' returned: -201 (Generic Pacemaker error)
Sep 28 17:55:09 node2 stonith-ng[14313]:  notice: Couldn't find anyone to fence (reboot) node1 with any device
Sep 28 17:55:09 node2 stonith-ng[14313]:   error: Operation reboot of node1 by <no-one> for crmd.14317@node2.3d1792b3: No route to host
Sep 28 17:55:09 node2 crmd[14317]:  notice: Stonith operation 16/33:17:0:c442aa93-5aa0-4a93-9a15-a2a606880900: No route to host (-113)
Sep 28 17:55:09 node2 crmd[14317]:  notice: Stonith operation 16 for node1 failed (No route to host): aborting transition.
Sep 28 17:55:09 node2 crmd[14317]: warning: Too many failures (12) to fence node1, giving up
Sep 28 17:55:09 node2 crmd[14317]:  notice: Transition aborted: Stonith failed

Environment

  • Red Hat Enterprise Linux 7 w/ High Availability or Pacemaker
  • multipath
  • fence_scsi

Subscriber exclusive content

A Red Hat subscription provides unlimited access to our knowledgebase, tools, and much more.

Current Customers and Partners

Log in for full access

Log In

New to Red Hat?

Learn more about Red Hat subscriptions

Using a Red Hat product through a public cloud?

How to access this content