SCSI reservation fencing hangs with Hitachi SAN and device-mapper-multipath on Red Hat Enterprise Linux Clusters
Issue
-
Why did fence_scsi not complete on node2 when racing with node1, allowing the nodes to try to form a new cluster and kill each other? Log example:
- Node1:
Apr 9 17:01:52 node1 corosync[44112]: [TOTEM ] A processor failed, forming new configuration. Apr 9 17:01:54 node1 corosync[44112]: [QUORUM] Members[1]: 1 Apr 9 17:01:54 node1 corosync[44112]: [TOTEM ] A processor joined or left the membership and a new membership was formed. Apr 9 17:01:54 node1 corosync[44112]: [CPG ] chosen downlist: sender r(0) ip(10.10.10.10) r(1) ip(100.254.180.51) ; members(old:2 left:1) Apr 9 17:01:54 node1 rgmanager[44854]: State change: node2 DOWN Apr 9 17:01:54 node1 corosync[44112]: [MAIN ] Completed service synchronization, ready to provide service. Apr 9 17:01:54 node1 fenced[44186]: fencing node node2 Apr 9 17:01:55 node1 fenced[44186]: fence node2 success- Node2 tries to fence Node1 but lost the race, and instead has scsi reservation errors and fencing does not complete:
Apr 9 17:01:52 node2 corosync[34337]: [TOTEM ] A processor failed, forming new configuration. Apr 9 17:01:54 node2 corosync[34337]: [QUORUM] Members[1]: 2 Apr 9 17:01:54 node2 corosync[34337]: [TOTEM ] A processor joined or left the membership and a new membership was formed. Apr 9 17:01:54 node2 corosync[34337]: [CPG ] chosen downlist: sender r(0) ip(10.10.10.11) r(1) ip(100.254.180.52) ; members(old:2 left:1) Apr 9 17:01:54 node2 rgmanager[35045]: State change: node1 DOWN Apr 9 17:01:54 node2 corosync[34337]: [MAIN ] Completed service synchronization, ready to provide service. Apr 9 17:01:54 node2 kernel: dlm: closing connection to node 1 Apr 9 17:01:54 node2 fenced[34398]: fencing node node1 ... Apr 9 17:02:02 node2 kernel: sd 1:0:0:16: reservation conflict Apr 9 17:02:12 node2 kernel: sd 4:0:0:17: reservation conflict Apr 9 17:02:12 node2 kernel: sd 1:0:0:18: reservation conflict Apr 9 17:02:12 node2 kernel: sd 4:0:0:19: reservation conflict Apr 9 17:02:12 node2 kernel: sd 1:0:0:20: reservation conflict Apr 9 17:02:13 node2 multipathd: 66:48: mark as failed Apr 9 17:02:13 node2 multipathd: exts1: remaining active paths: 1 ... Apr 9 17:01:55 node2 corosync[34337]: [TOTEM ] Automatically recovered ring 1 Apr 9 17:02:21 node2 corosync[34337]: [TOTEM ] A processor joined or left the membership and a new membership was formed. Apr 9 17:02:21 node2 corosync[34337]: [CPG ] chosen downlist: sender r(0) ip(10.10.10.11) r(1) ip(100.254.180.52) ; members(old:1 left:0) Apr 9 17:02:21 node2 corosync[34337]: [MAIN ] Completed service synchronization, ready to provide service.- Node1 tries to kill node2 for joining with existing state, and ends up killing itself:
Apr 9 17:02:40 node1 corosync[44112]: [TOTEM ] A processor joined or left the membership and a new membership was formed. Apr 9 17:02:40 node1 corosync[44112]: [QUORUM] Members[2]: 1 2 Apr 9 17:02:40 node1 corosync[44112]: [QUORUM] Members[2]: 1 2 Apr 9 17:02:40 node1 corosync[44112]: [CPG ] chosen downlist: sender r(0) ip(10.10.10.10) r(1) ip(100.254.180.51) ; members(old:1 left:0) Apr 9 17:02:40 node1 corosync[44112]: [MAIN ] Completed service synchronization, ready to provide service. Apr 9 17:02:40 node1 fenced[44186]: telling cman to remove nodeid 2 from cluster Apr 9 17:02:40 node1 corosync[44112]: cman killed by node 2 because we were killed by cman_tool or other application Apr 9 17:02:40 node1 fenced[44186]: cluster is down, exiting
Environment
- Red Hat Enterprise Linux Server 6 (with the High Availability or Resilient Storage Add Ons)
-
Red Hat High Availability cluster with 2 or more nodes:
- Fencing method is fence_scsi.
-
Device-mapper-multipath configured with Hitachi DF600F model SAN:
- SAN devices are
HITACHI" "DF600Fmodel:
# cat /proc/scsi/scsi Host: scsi1 Channel: 00 Id: 00 Lun: 00 Vendor: HITACHI Model: DF600F Rev: 0000 Type: Direct-Access ANSI SCSI revision: 04 Host: scsi4 Channel: 00 Id: 00 Lun: 01 Vendor: HITACHI Model: DF600F Rev: 0000- Specifically, the directio path checker is in use (this is not the default):
# grep path_checker /etc/multipath.conf path_checker directio - SAN devices are
Subscriber exclusive content
A Red Hat subscription provides unlimited access to our knowledgebase of over 48,000 articles and solutions.
Welcome! Check out the Getting Started with Red Hat page for quick tours and guides for common tasks.
