fence_scsi_check.pl watchdog script does a soft reboot instead of hard and hangs during shutdown in a RHEL 6 or 7 Resilient Storage cluster with device-mapper-multipath

Solution Unverified - Updated -

Issue

  • After manually fencing a node with actively running a resource group, scsi watchdog begins to initiate a reboot but fails to completely reboot the machine.
  • When watchdog reboots a node, it gets stuck shutting down. I see backtraces with it waiting on device mapper or the file system

Environment

  • Red Hat Enterprise Linux (RHEL) 6 or 7 with the High Availability Add On
  • Using SCSI Persistent Reservation Fencing (fence_scsi)
  • Using the fence_scsi_check.pl watchdog script for fence_scsi to reboot a node when fenced
    • RHEL 7:
    • RHEL 6:
      • Using a fence-agents release prior to 3.1.5-48.el6, OR
      • Using fence-agents-3.1.5-48.el6 or later AND /usr/share/cluster/fence_scsi_check.pl is linked or copied to /etc/watchdog.d (as opposed to /usr/share/cluster/fence_scsi_check_hardreboot.pl being linked or copied)
  • device-mapper-multipath
    • The settings for the device in question enable queueing (even if only temporary) when all paths have failed
      • Can be enabled via no_path_retry set to "queue" or a value greater than 0 in /etc/multipath.conf, or in the built-in device settings in multipathd (see /usr/share/doc/device-mapper-multipath-$vers/multipath.conf.defaults)
      • Can be enabled via features "1 queue_if_no_path" in /etc/multipath.conf or built-in device settings in multipathd if no_path_retry is not set.

Subscriber exclusive content

A Red Hat subscription provides unlimited access to our knowledgebase of over 48,000 articles and solutions.

Current Customers and Partners

Log in for full access

Log In
Close

Welcome! Check out the Getting Started with Red Hat page for quick tours and guides for common tasks.