SCSI reservation fencing hangs with Hitachi SAN and device-mapper-multipath on Red Hat Enterprise Linux Clusters

Solution Unverified - Updated 2024-08-05T07:18:40+00:00 -

Issue

Why did fence_scsi not complete on node2 when racing with node1, allowing the nodes to try to form a new cluster and kill each other? Log example:

Node1:

Apr  9 17:01:52 node1 corosync[44112]:   [TOTEM ] A processor failed, forming new configuration.
Apr  9 17:01:54 node1 corosync[44112]:   [QUORUM] Members[1]: 1
Apr  9 17:01:54 node1 corosync[44112]:   [TOTEM ] A processor joined or left the membership and a new membership was formed.
Apr  9 17:01:54 node1 corosync[44112]:   [CPG   ] chosen downlist: sender r(0) ip(10.10.10.10) r(1) ip(100.254.180.51) ; members(old:2 left:1)
Apr  9 17:01:54 node1 rgmanager[44854]: State change: node2 DOWN
Apr  9 17:01:54 node1 corosync[44112]:   [MAIN  ] Completed service synchronization, ready to provide service.
Apr  9 17:01:54 node1 fenced[44186]: fencing node node2
Apr  9 17:01:55 node1 fenced[44186]: fence node2 success

Node2 tries to fence Node1 but lost the race, and instead has scsi reservation errors and fencing does not complete:

Apr  9 17:01:52 node2 corosync[34337]:   [TOTEM ] A processor failed, forming new configuration.
Apr  9 17:01:54 node2 corosync[34337]:   [QUORUM] Members[1]: 2
Apr  9 17:01:54 node2 corosync[34337]:   [TOTEM ] A processor joined or left the membership and a new membership was formed.
Apr  9 17:01:54 node2 corosync[34337]:   [CPG   ] chosen downlist: sender r(0) ip(10.10.10.11) r(1) ip(100.254.180.52) ; members(old:2 left:1)
Apr  9 17:01:54 node2 rgmanager[35045]: State change: node1 DOWN
Apr  9 17:01:54 node2 corosync[34337]:   [MAIN  ] Completed service synchronization, ready to provide service.
Apr  9 17:01:54 node2 kernel: dlm: closing connection to node 1
Apr  9 17:01:54 node2 fenced[34398]: fencing node node1
...
Apr  9 17:02:02 node2 kernel: sd 1:0:0:16: reservation conflict
Apr  9 17:02:12 node2 kernel: sd 4:0:0:17: reservation conflict
Apr  9 17:02:12 node2 kernel: sd 1:0:0:18: reservation conflict
Apr  9 17:02:12 node2 kernel: sd 4:0:0:19: reservation conflict
Apr  9 17:02:12 node2 kernel: sd 1:0:0:20: reservation conflict
Apr  9 17:02:13 node2 multipathd: 66:48: mark as failed
Apr  9 17:02:13 node2 multipathd: exts1: remaining active paths: 1
...
Apr  9 17:01:55 node2 corosync[34337]:   [TOTEM ] Automatically recovered ring 1
Apr  9 17:02:21 node2 corosync[34337]:   [TOTEM ] A processor joined or left the membership and a new membership was formed.
Apr  9 17:02:21 node2 corosync[34337]:   [CPG   ] chosen downlist: sender r(0) ip(10.10.10.11) r(1) ip(100.254.180.52) ; members(old:1 left:0)
Apr  9 17:02:21 node2 corosync[34337]:   [MAIN  ] Completed service synchronization, ready to provide service.

Node1 tries to kill node2 for joining with existing state, and ends up killing itself:

Apr  9 17:02:40 node1 corosync[44112]:   [TOTEM ] A processor joined or left the membership and a new membership was formed.
Apr  9 17:02:40 node1 corosync[44112]:   [QUORUM] Members[2]: 1 2
Apr  9 17:02:40 node1 corosync[44112]:   [QUORUM] Members[2]: 1 2
Apr  9 17:02:40 node1 corosync[44112]:   [CPG   ] chosen downlist: sender r(0) ip(10.10.10.10) r(1) ip(100.254.180.51) ; members(old:1 left:0)
Apr  9 17:02:40 node1 corosync[44112]:   [MAIN  ] Completed service synchronization, ready to provide service.
Apr  9 17:02:40 node1 fenced[44186]: telling cman to remove nodeid 2 from cluster
Apr  9 17:02:40 node1 corosync[44112]: cman killed by node 2 because we were killed by cman_tool or other application
Apr  9 17:02:40 node1 fenced[44186]: cluster is down, exiting

Environment

Red Hat Enterprise Linux Server 6 (with the High Availability or Resilient Storage Add Ons)
Red Hat High Availability cluster with 2 or more nodes:
- Fencing method is fence_scsi.

Device-mapper-multipath configured with Hitachi DF600F model SAN:

SAN devices are HITACHI" "DF600F model:

# cat /proc/scsi/scsi
Host: scsi1 Channel: 00 Id: 00 Lun: 00
  Vendor: HITACHI  Model: DF600F           Rev: 0000
  Type:   Direct-Access                    ANSI  SCSI revision: 04
Host: scsi4 Channel: 00 Id: 00 Lun: 01
  Vendor: HITACHI  Model: DF600F           Rev: 0000

Specifically, the directio path checker is in use (this is not the default):

# grep path_checker /etc/multipath.conf
path_checker directio

Subscriber exclusive content

A Red Hat subscription provides unlimited access to our knowledgebase, tools, and much more.

Select Your Language

SCSI reservation fencing hangs with Hitachi SAN and device-mapper-multipath on Red Hat Enterprise Linux Clusters

Issue

Environment

Subscriber exclusive content

Current Customers and Partners

New to Red Hat?

Using a Red Hat product through a public cloud?

Quick Links

Help

Site Info

Related Sites

About

Red Hat legal and privacy links

Red Hat legal and privacy links

Issue

Environment

Subscriber exclusive content

Current Customers and Partners

New to Red Hat?

Using a Red Hat product through a public cloud?

Quick Links

Help

Site Info

Related Sites

Systems Status

About

Red Hat legal and privacy links

Red Hat legal and privacy links