fence_scsi_check.pl watchdog script does a soft reboot instead of hard and hangs during shutdown in a RHEL 6 or 7 Resilient Storage cluster with device-mapper-multipath

Solution Unverified - Updated 2024-08-05T07:54:09+00:00 -

Issue

After manually fencing a node with actively running a resource group, scsi watchdog begins to initiate a reboot but fails to completely reboot the machine.
When watchdog reboots a node, it gets stuck shutting down. I see backtraces with it waiting on device mapper or the file system

Red Hat Enterprise Linux (RHEL) 6 or 7 with the High Availability Add On
Using SCSI Persistent Reservation Fencing (fence_scsi)
Using the fence_scsi_check.pl watchdog script for fence_scsi to reboot a node when fenced
- RHEL 7:
  - Using a fence-agents-scsi release prior to 4.0.11-27.el7_2.5, OR
  - Using fence-agents-scsi-4.0.11-27.el7_2.5 or later AND /etc/watchdog.d/fence_scsi_check is in place (as opposed to /etc/watchdog.d/fence_scsi_check_hardreboot)
- RHEL 6:
  - Using a fence-agents release prior to 3.1.5-48.el6, OR
  - Using fence-agents-3.1.5-48.el6 or later AND /usr/share/cluster/fence_scsi_check.pl is linked or copied to /etc/watchdog.d (as opposed to /usr/share/cluster/fence_scsi_check_hardreboot.pl being linked or copied)
device-mapper-multipath
- The settings for the device in question enable queueing (even if only temporary) when all paths have failed
  - Can be enabled via no_path_retry set to "queue" or a value greater than 0 in /etc/multipath.conf, or in the built-in device settings in multipathd (see /usr/share/doc/device-mapper-multipath-$vers/multipath.conf.defaults)
  - Can be enabled via features "1 queue_if_no_path" in /etc/multipath.conf or built-in device settings in multipathd if no_path_retry is not set.

A Red Hat subscription provides unlimited access to our knowledgebase, tools, and much more.