RHEL 7 High Availability and Resilient Storage Pacemaker cluster experiences a fence race condition between nodes during network outages while using fence_scsi with multipath storage
Issue
RHEL 7 Pacemaker cluster nodes can experience a fence race condition when utilizing fence_scsi with multipath storage, resulting in nodes partially fencing one another resulting in fence failure on multiple nodes:
Node 1 fenced 2 of 3 shared devices, but failed to fence the 3rd:
Sep 28 17:25:03 node1 crmd[2654]: notice: Requesting fencing (reboot) of node node2
Sep 28 17:25:03 node1 stonith-ng[2650]: notice: Client crmd.2654.3f6ab593 wants to fence (reboot) 'node2' with device '(any)'
Sep 28 17:25:03 node1 stonith-ng[2650]: notice: Requesting peer fencing (reboot) of node2
Sep 28 17:25:03 node1 stonith-ng[2650]: notice: fence_scsi can fence (reboot) node2: static-list
Sep 28 17:25:03 node1 stonith-ng[2650]: notice: fence_scsi can fence (reboot) node2: static-list
...
Sep 28 17:25:05 node1 fence_scsi: Failed to verify 1 device(s)
Sep 28 17:25:05 node1 stonith-ng[2650]: warning: fence_scsi[9357] stderr: [ WARNING:root:Parse error: Ignoring unknown option 'port=node2' ]
Sep 28 17:25:05 node1 stonith-ng[2650]: warning: fence_scsi[9357] stderr: [ ]
Sep 28 17:25:05 node1 stonith-ng[2650]: warning: fence_scsi[9357] stderr: [ ERROR:root:Failed to verify 1 device(s) ]
Sep 28 17:25:05 node1 stonith-ng[2650]: warning: fence_scsi[9357] stderr: [ Failed to verify 1 device(s) ]
Sep 28 17:25:05 node1 stonith-ng[2650]: error: Operation 'reboot' [9357] (call 13 from crmd.2654) for host 'node2' with device 'fence_scsi' returned: -201 (Generic Pacemaker error)
Sep 28 17:25:05 node1 stonith-ng[2650]: notice: Couldn't find anyone to fence (reboot) node2 with any device
Sep 28 17:25:05 node1 stonith-ng[2650]: error: Operation reboot of node2 by <no-one> for crmd.2654@node1.fe23c10a: No route to host
Sep 28 17:25:05 node1 crmd[2654]: notice: Stonith operation 13/88:12:0:376e05b2-17dc-4978-9e12-24373846add2: No route to host (-113)
Sep 28 17:25:05 node1 crmd[2654]: notice: Stonith operation 13 for node2 failed (No route to host): aborting transition.
Sep 28 17:25:05 node1 crmd[2654]: warning: Too many failures (11) to fence node2, giving up
Sep 28 17:25:05 node1 crmd[2654]: notice: Transition aborted: Stonith failed
Node 2 fenced the 3rd device, but fails to fence the other 2:
Sep 28 17:55:07 node2 crmd[14317]: notice: Requesting fencing (reboot) of node node1
Sep 28 17:55:07 node2 stonith-ng[14313]: notice: Client crmd.14317.56ab0693 wants to fence (reboot) 'node1' with device '(any)'
Sep 28 17:55:07 node2 stonith-ng[14313]: notice: Requesting peer fencing (reboot) of node1
Sep 28 17:55:07 node2 stonith-ng[14313]: notice: fence_scsi can fence (reboot) node1: static-list
Sep 28 17:55:07 node2 stonith-ng[14313]: notice: fence_scsi can fence (reboot) node1: static-list
Sep 28 17:55:07 node2 stonith-ng[14313]: warning: Agent 'fence_scsi' does not advertise support for 'reboot', performing 'off' action instead
Sep 28 17:55:07 node2 fence_scsi: Failed to verify 2 device(s)
Sep 28 17:55:07 node2 stonith-ng[14313]: warning: fence_scsi[47186] stderr: [ WARNING:root:Parse error: Ignoring unknown option 'port=node1' ]
Sep 28 17:55:07 node2 stonith-ng[14313]: warning: fence_scsi[47186] stderr: [ ]
Sep 28 17:55:07 node2 stonith-ng[14313]: warning: fence_scsi[47186] stderr: [ ERROR:root:Failed to verify 2 device(s) ]
Sep 28 17:55:07 node2 stonith-ng[14313]: warning: fence_scsi[47186] stderr: [ Failed to verify 2 device(s) ]
Sep 28 17:55:08 node2 fence_scsi: Failed to verify 2 device(s)
Sep 28 17:55:08 node2 stonith-ng[14313]: warning: fence_scsi[47261] stderr: [ WARNING:root:Parse error: Ignoring unknown option 'port=node1' ]
Sep 28 17:55:08 node2 stonith-ng[14313]: warning: fence_scsi[47261] stderr: [ ]
Sep 28 17:55:08 node2 stonith-ng[14313]: warning: fence_scsi[47261] stderr: [ ERROR:root:Failed to verify 2 device(s) ]
Sep 28 17:55:08 node2 stonith-ng[14313]: warning: fence_scsi[47261] stderr: [ Failed to verify 2 device(s) ]
Sep 28 17:55:08 node2 stonith-ng[14313]: error: Operation 'reboot' [47261] (call 16 from crmd.14317) for host 'node1' with device 'fence_scsi' returned: -201 (Generic Pacemaker error)
Sep 28 17:55:09 node2 stonith-ng[14313]: notice: Couldn't find anyone to fence (reboot) node1 with any device
Sep 28 17:55:09 node2 stonith-ng[14313]: error: Operation reboot of node1 by <no-one> for crmd.14317@node2.3d1792b3: No route to host
Sep 28 17:55:09 node2 crmd[14317]: notice: Stonith operation 16/33:17:0:c442aa93-5aa0-4a93-9a15-a2a606880900: No route to host (-113)
Sep 28 17:55:09 node2 crmd[14317]: notice: Stonith operation 16 for node1 failed (No route to host): aborting transition.
Sep 28 17:55:09 node2 crmd[14317]: warning: Too many failures (12) to fence node1, giving up
Sep 28 17:55:09 node2 crmd[14317]: notice: Transition aborted: Stonith failed
Environment
- Red Hat Enterprise Linux 7 w/ High Availability or Pacemaker
- multipath
- fence_scsi
Subscriber exclusive content
A Red Hat subscription provides unlimited access to our knowledgebase, tools, and much more.