All nodes in Oracle RAC cluster reboot when SAN controller fails on Red Hat Enterprise Linux 5.10
Issue
- A SAN event resulted in all eight nodes in a RHEL 5.10 Oracle 10g cluster rebooting simultaneously. The director board failure caused one of two paths per LUN to fail.
- The kernel logs from right before the reboot showed that there were many SCSI errors/timeouts and ended with "SysRq : Resetting" before the systems starting booting.
- Why didn't the multipath daemon didn't take advantage of the working paths (or rather disable the failed paths)?
Environment
- Red Hat Enterprise Linux 5.10 with kernel-2.6.18-371.6.1.el5 or later.
- Oracle RAC 10G environment with two or more nodes
- Oracle RAC's voting disk timeout is 200 seconds
- Device-mapper-multipath
- SAN environment suffering controller failure (for example, director board failure)
- Similar symptoms may be seen if a failure on the fabric occurs and RSCN's are not sent to alert targets of the change.
- Raw devices used for some Oracle devices
Subscriber exclusive content
A Red Hat subscription provides unlimited access to our knowledgebase, tools, and much more.