SCSI reservation fencing hangs with Hitachi SAN and device-mapper-multipath on Red Hat Enterprise Linux Clusters

Solution Unverified - Updated -

Issue

  • Why did fence_scsi not complete on node2 when racing with node1, allowing the nodes to try to form a new cluster and kill each other? Log example:

    • Node1:
    Apr  9 17:01:52 node1 corosync[44112]:   [TOTEM ] A processor failed, forming new configuration.
    Apr  9 17:01:54 node1 corosync[44112]:   [QUORUM] Members[1]: 1
    Apr  9 17:01:54 node1 corosync[44112]:   [TOTEM ] A processor joined or left the membership and a new membership was formed.
    Apr  9 17:01:54 node1 corosync[44112]:   [CPG   ] chosen downlist: sender r(0) ip(10.10.10.10) r(1) ip(100.254.180.51) ; members(old:2 left:1)
    Apr  9 17:01:54 node1 rgmanager[44854]: State change: node2 DOWN
    Apr  9 17:01:54 node1 corosync[44112]:   [MAIN  ] Completed service synchronization, ready to provide service.
    Apr  9 17:01:54 node1 fenced[44186]: fencing node node2
    Apr  9 17:01:55 node1 fenced[44186]: fence node2 success
    
    • Node2 tries to fence Node1 but lost the race, and instead has scsi reservation errors and fencing does not complete:
    Apr  9 17:01:52 node2 corosync[34337]:   [TOTEM ] A processor failed, forming new configuration.
    Apr  9 17:01:54 node2 corosync[34337]:   [QUORUM] Members[1]: 2
    Apr  9 17:01:54 node2 corosync[34337]:   [TOTEM ] A processor joined or left the membership and a new membership was formed.
    Apr  9 17:01:54 node2 corosync[34337]:   [CPG   ] chosen downlist: sender r(0) ip(10.10.10.11) r(1) ip(100.254.180.52) ; members(old:2 left:1)
    Apr  9 17:01:54 node2 rgmanager[35045]: State change: node1 DOWN
    Apr  9 17:01:54 node2 corosync[34337]:   [MAIN  ] Completed service synchronization, ready to provide service.
    Apr  9 17:01:54 node2 kernel: dlm: closing connection to node 1
    Apr  9 17:01:54 node2 fenced[34398]: fencing node node1
    ...
    Apr  9 17:02:02 node2 kernel: sd 1:0:0:16: reservation conflict
    Apr  9 17:02:12 node2 kernel: sd 4:0:0:17: reservation conflict
    Apr  9 17:02:12 node2 kernel: sd 1:0:0:18: reservation conflict
    Apr  9 17:02:12 node2 kernel: sd 4:0:0:19: reservation conflict
    Apr  9 17:02:12 node2 kernel: sd 1:0:0:20: reservation conflict
    Apr  9 17:02:13 node2 multipathd: 66:48: mark as failed
    Apr  9 17:02:13 node2 multipathd: exts1: remaining active paths: 1
    ...
    Apr  9 17:01:55 node2 corosync[34337]:   [TOTEM ] Automatically recovered ring 1
    Apr  9 17:02:21 node2 corosync[34337]:   [TOTEM ] A processor joined or left the membership and a new membership was formed.
    Apr  9 17:02:21 node2 corosync[34337]:   [CPG   ] chosen downlist: sender r(0) ip(10.10.10.11) r(1) ip(100.254.180.52) ; members(old:1 left:0)
    Apr  9 17:02:21 node2 corosync[34337]:   [MAIN  ] Completed service synchronization, ready to provide service.
    
    • Node1 tries to kill node2 for joining with existing state, and ends up killing itself:
    Apr  9 17:02:40 node1 corosync[44112]:   [TOTEM ] A processor joined or left the membership and a new membership was formed.
    Apr  9 17:02:40 node1 corosync[44112]:   [QUORUM] Members[2]: 1 2
    Apr  9 17:02:40 node1 corosync[44112]:   [QUORUM] Members[2]: 1 2
    Apr  9 17:02:40 node1 corosync[44112]:   [CPG   ] chosen downlist: sender r(0) ip(10.10.10.10) r(1) ip(100.254.180.51) ; members(old:1 left:0)
    Apr  9 17:02:40 node1 corosync[44112]:   [MAIN  ] Completed service synchronization, ready to provide service.
    Apr  9 17:02:40 node1 fenced[44186]: telling cman to remove nodeid 2 from cluster
    Apr  9 17:02:40 node1 corosync[44112]: cman killed by node 2 because we were killed by cman_tool or other application
    Apr  9 17:02:40 node1 fenced[44186]: cluster is down, exiting
    

Environment

  • Red Hat Enterprise Linux Server 6 (with the High Availability or Resilient Storage Add Ons)
  • Red Hat High Availability cluster with 2 or more nodes:

  • Device-mapper-multipath configured with Hitachi DF600F model SAN:

    • SAN devices are HITACHI" "DF600F model:
    # cat /proc/scsi/scsi
    Host: scsi1 Channel: 00 Id: 00 Lun: 00
      Vendor: HITACHI  Model: DF600F           Rev: 0000
      Type:   Direct-Access                    ANSI  SCSI revision: 04
    Host: scsi4 Channel: 00 Id: 00 Lun: 01
      Vendor: HITACHI  Model: DF600F           Rev: 0000
    
    • Specifically, the directio path checker is in use (this is not the default):
    # grep path_checker /etc/multipath.conf
    path_checker directio   
    

Subscriber exclusive content

A Red Hat subscription provides unlimited access to our knowledgebase, tools, and much more.

Current Customers and Partners

Log in for full access

Log In

New to Red Hat?

Learn more about Red Hat subscriptions

Using a Red Hat product through a public cloud?

How to access this content