SCSI reservation fencing hangs with Hitachi SAN and device-mapper-multipath on Red Hat Enterprise Linux Clusters
Issue
-
Why did fence_scsi not complete on node2 when racing with node1, allowing the nodes to try to form a new cluster and kill each other? Log example:
- Node1:
Apr 9 17:01:52 node1 corosync[44112]: [TOTEM ] A processor failed, forming new configuration. Apr 9 17:01:54 node1 corosync[44112]: [QUORUM] Members[1]: 1 Apr 9 17:01:54 node1 corosync[44112]: [TOTEM ] A processor joined or left the membership and a new membership was formed. Apr 9 17:01:54 node1 corosync[44112]: [CPG ] chosen downlist: sender r(0) ip(10.10.10.10) r(1) ip(100.254.180.51) ; members(old:2 left:1) Apr 9 17:01:54 node1 rgmanager[44854]: State change: node2 DOWN Apr 9 17:01:54 node1 corosync[44112]: [MAIN ] Completed service synchronization, ready to provide service. Apr 9 17:01:54 node1 fenced[44186]: fencing node node2 Apr 9 17:01:55 node1 fenced[44186]: fence node2 success
- Node2 tries to fence Node1 but lost the race, and instead has scsi reservation errors and fencing does not complete:
Apr 9 17:01:52 node2 corosync[34337]: [TOTEM ] A processor failed, forming new configuration. Apr 9 17:01:54 node2 corosync[34337]: [QUORUM] Members[1]: 2 Apr 9 17:01:54 node2 corosync[34337]: [TOTEM ] A processor joined or left the membership and a new membership was formed. Apr 9 17:01:54 node2 corosync[34337]: [CPG ] chosen downlist: sender r(0) ip(10.10.10.11) r(1) ip(100.254.180.52) ; members(old:2 left:1) Apr 9 17:01:54 node2 rgmanager[35045]: State change: node1 DOWN Apr 9 17:01:54 node2 corosync[34337]: [MAIN ] Completed service synchronization, ready to provide service. Apr 9 17:01:54 node2 kernel: dlm: closing connection to node 1 Apr 9 17:01:54 node2 fenced[34398]: fencing node node1 ... Apr 9 17:02:02 node2 kernel: sd 1:0:0:16: reservation conflict Apr 9 17:02:12 node2 kernel: sd 4:0:0:17: reservation conflict Apr 9 17:02:12 node2 kernel: sd 1:0:0:18: reservation conflict Apr 9 17:02:12 node2 kernel: sd 4:0:0:19: reservation conflict Apr 9 17:02:12 node2 kernel: sd 1:0:0:20: reservation conflict Apr 9 17:02:13 node2 multipathd: 66:48: mark as failed Apr 9 17:02:13 node2 multipathd: exts1: remaining active paths: 1 ... Apr 9 17:01:55 node2 corosync[34337]: [TOTEM ] Automatically recovered ring 1 Apr 9 17:02:21 node2 corosync[34337]: [TOTEM ] A processor joined or left the membership and a new membership was formed. Apr 9 17:02:21 node2 corosync[34337]: [CPG ] chosen downlist: sender r(0) ip(10.10.10.11) r(1) ip(100.254.180.52) ; members(old:1 left:0) Apr 9 17:02:21 node2 corosync[34337]: [MAIN ] Completed service synchronization, ready to provide service.
- Node1 tries to kill node2 for joining with existing state, and ends up killing itself:
Apr 9 17:02:40 node1 corosync[44112]: [TOTEM ] A processor joined or left the membership and a new membership was formed. Apr 9 17:02:40 node1 corosync[44112]: [QUORUM] Members[2]: 1 2 Apr 9 17:02:40 node1 corosync[44112]: [QUORUM] Members[2]: 1 2 Apr 9 17:02:40 node1 corosync[44112]: [CPG ] chosen downlist: sender r(0) ip(10.10.10.10) r(1) ip(100.254.180.51) ; members(old:1 left:0) Apr 9 17:02:40 node1 corosync[44112]: [MAIN ] Completed service synchronization, ready to provide service. Apr 9 17:02:40 node1 fenced[44186]: telling cman to remove nodeid 2 from cluster Apr 9 17:02:40 node1 corosync[44112]: cman killed by node 2 because we were killed by cman_tool or other application Apr 9 17:02:40 node1 fenced[44186]: cluster is down, exiting
Environment
- Red Hat Enterprise Linux Server 6 (with the High Availability or Resilient Storage Add Ons)
-
Red Hat High Availability cluster with 2 or more nodes:
- Fencing method is fence_scsi.
-
Device-mapper-multipath configured with Hitachi DF600F model SAN:
- SAN devices are
HITACHI" "DF600F
model:
# cat /proc/scsi/scsi Host: scsi1 Channel: 00 Id: 00 Lun: 00 Vendor: HITACHI Model: DF600F Rev: 0000 Type: Direct-Access ANSI SCSI revision: 04 Host: scsi4 Channel: 00 Id: 00 Lun: 01 Vendor: HITACHI Model: DF600F Rev: 0000
- Specifically, the directio path checker is in use (this is not the default):
# grep path_checker /etc/multipath.conf path_checker directio
- SAN devices are
Subscriber exclusive content
A Red Hat subscription provides unlimited access to our knowledgebase, tools, and much more.