fence_scsi_check.pl watchdog script does a soft reboot instead of hard and hangs during shutdown in a RHEL 6 or 7 Resilient Storage cluster with device-mapper-multipath

Solution Unverified - Updated -

Issue

  • After manually fencing a node with actively running a resource group, scsi watchdog begins to initiate a reboot but fails to completely reboot the machine.
  • When watchdog reboots a node, it gets stuck shutting down. I see backtraces with it waiting on device mapper or the file system

Environment

  • Red Hat Enterprise Linux (RHEL) 6 or 7 with the High Availability Add On
  • Using SCSI Persistent Reservation Fencing (fence_scsi)
  • Using the fence_scsi_check.pl watchdog script for fence_scsi to reboot a node when fenced
    • RHEL 7:
    • RHEL 6:
      • Using a fence-agents release prior to 3.1.5-48.el6, OR
      • Using fence-agents-3.1.5-48.el6 or later AND /usr/share/cluster/fence_scsi_check.pl is linked or copied to /etc/watchdog.d (as opposed to /usr/share/cluster/fence_scsi_check_hardreboot.pl being linked or copied)
  • device-mapper-multipath
    • The settings for the device in question enable queueing (even if only temporary) when all paths have failed
      • Can be enabled via no_path_retry set to "queue" or a value greater than 0 in /etc/multipath.conf, or in the built-in device settings in multipathd (see /usr/share/doc/device-mapper-multipath-$vers/multipath.conf.defaults)
      • Can be enabled via features "1 queue_if_no_path" in /etc/multipath.conf or built-in device settings in multipathd if no_path_retry is not set.

Subscriber exclusive content

A Red Hat subscription provides unlimited access to our knowledgebase, tools, and much more.

Current Customers and Partners

Log in for full access

Log In

New to Red Hat?

Learn more about Red Hat subscriptions

Using a Red Hat product through a public cloud?

How to access this content