fence_scsi_check.pl watchdog script does a soft reboot instead of hard and hangs during shutdown in a RHEL 6 or 7 Resilient Storage cluster with device-mapper-multipath
Issue
- After manually fencing a node with actively running a resource group, scsi watchdog begins to initiate a reboot but fails to completely reboot the machine.
- When watchdog reboots a node, it gets stuck shutting down. I see backtraces with it waiting on device mapper or the file system
Environment
- Red Hat Enterprise Linux (RHEL) 6 or 7 with the High Availability Add On
- Using SCSI Persistent Reservation Fencing (
fence_scsi) - Using the
fence_scsi_check.plwatchdog script forfence_scsito reboot a node when fenced- RHEL 7:
- Using a
fence-agents-scsirelease prior to4.0.11-27.el7_2.5, OR - Using
fence-agents-scsi-4.0.11-27.el7_2.5or later AND/etc/watchdog.d/fence_scsi_checkis in place (as opposed to/etc/watchdog.d/fence_scsi_check_hardreboot)
- Using a
- RHEL 6:
- Using a
fence-agentsrelease prior to3.1.5-48.el6, OR - Using
fence-agents-3.1.5-48.el6or later AND/usr/share/cluster/fence_scsi_check.plis linked or copied to/etc/watchdog.d(as opposed to/usr/share/cluster/fence_scsi_check_hardreboot.plbeing linked or copied)
- Using a
- RHEL 7:
device-mapper-multipath- The settings for the device in question enable queueing (even if only temporary) when all paths have failed
- Can be enabled via
no_path_retryset to "queue" or a value greater than 0 in/etc/multipath.conf, or in the built-in device settings inmultipathd(see/usr/share/doc/device-mapper-multipath-$vers/multipath.conf.defaults) - Can be enabled via
features "1 queue_if_no_path"in/etc/multipath.confor built-in device settings inmultipathdifno_path_retryis not set.
- Can be enabled via
- The settings for the device in question enable queueing (even if only temporary) when all paths have failed
Subscriber exclusive content
A Red Hat subscription provides unlimited access to our knowledgebase, tools, and much more.