All nodes in Oracle RAC cluster reboot when SAN controller fails on Red Hat Enterprise Linux 5.10
Issue
- A SAN event resulted in all eight nodes in a RHEL 5.10 Oracle 10g cluster rebooting simultaneously. The director board failure caused one of two paths per LUN to fail.
- The kernel logs from right before the reboot showed that there were many SCSI errors/timeouts and ended with "SysRq : Resetting" before the systems starting booting.
- Why didn't the multipath daemon didn't take advantage of the working paths (or rather disable the failed paths)?
Environment
- Red Hat Enterprise Linux 5.10 with kernel-2.6.18-371.6.1.el5 or later.
- Oracle RAC 10G environment with two or more nodes
- Oracle RAC's voting disk timeout is 200 seconds
- Device-mapper-multipath
- SAN environment suffering controller failure (for example, director board failure)
- Similar symptoms may be seen if a failure on the fabric occurs and RSCN's are not sent to alert targets of the change.
- Raw devices used for some Oracle devices
Subscriber exclusive content
A Red Hat subscription provides unlimited access to our knowledgebase of over 48,000 articles and solutions.
Welcome! Check out the Getting Started with Red Hat page for quick tours and guides for common tasks.
