All nodes in Oracle RAC cluster reboot when SAN controller fails on Red Hat Enterprise Linux 5.10

Solution Unverified - Updated 2024-08-05T06:52:23+00:00 -

Issue

A SAN event resulted in all eight nodes in a RHEL 5.10 Oracle 10g cluster rebooting simultaneously. The director board failure caused one of two paths per LUN to fail.
- The kernel logs from right before the reboot showed that there were many SCSI errors/timeouts and ended with "SysRq : Resetting" before the systems starting booting.
- Why didn't the multipath daemon didn't take advantage of the working paths (or rather disable the failed paths)?

Red Hat Enterprise Linux 5.10 with kernel-2.6.18-371.6.1.el5 or later.
Oracle RAC 10G environment with two or more nodes
- Oracle RAC's voting disk timeout is 200 seconds
Device-mapper-multipath
SAN environment suffering controller failure (for example, director board failure)
- Similar symptoms may be seen if a failure on the fabric occurs and RSCN's are not sent to alert targets of the change.
Raw devices used for some Oracle devices

A Red Hat subscription provides unlimited access to our knowledgebase, tools, and much more.