The Oracle RAC server crashed due to the abrupt closure of the character device '/dev/watchdog0'.
Environment
- Red Hat Enterprise Linux
Issue
-
There are couple of "kernel: watchdog: watchdog0: watchdog did not stop!" logs reported around the reboot timestamp.
-
The sever restarted with below errors.
$ less /var/log/messages
Feb 2 04:19:08 node1 kernel: watchdog: watchdog0: watchdog did not stop!
Feb 2 04:19:08 node1 kernel: watchdog: watchdog0: watchdog did not stop!
Feb 2 04:25:47 node1 kernel: Command line: BOOT_IMAGE=(hd0,gpt2)/vmlinuz-4.18.0-553.36.1.el8_10.x86_64 root=/dev/mapper/rootVG-rootLV ro crashkernel=auto spectre_v2=retpoline rd.lvm.lv=rootVG/rootLV rd.lvm.lv=rootVG/swapLV rhgb quiet audit=1 rd.break
Resolution
- Contact Oracle Support to verify whether any process related to OSYSMOND opens the
/dev/watchdog0
file. - Refer article How to track process which is opening /dev/watchdog and closing incorrectly causing "watchdog did not stop!" message followed by system crash? for configuring the SystemTap script to monitor process(es) that open and close the
/dev/watchdog0
file.
Root Cause
- The analysis indicates that the system reboot occurred as a result of the unexpected closure of the character device
/dev/watchdog0
. - When user-space utilities such as cat, grep, fdisk, etc., open the
/dev/watchdog0
file, the watchdog timer is activated and remains active until the file is re-opened and a specific character (magic close) is written to it. If this character is not written, the watchdog timer will expire, causing a system reboot.
Diagnostic Steps
- Journal Logs:
$ less sos_commands/logs/journalctl_--no-pager
...
Feb 02 04:19:06 node1 osysmond.bin[134935]: Oracle Clusterware: 2025-02-02 04:19:06.603 [(134935)]CRS-8500:Oracle Clusterware OSYSMOND process is starting with operating system process ID 134935
Feb 02 04:19:08 node1 kernel: watchdog: watchdog0: watchdog did not stop! <<< - - -
Feb 02 04:19:08 node1 kernel: watchdog: watchdog0: watchdog did not stop! <<< - - -
Feb 02 04:19:08 node1 kernel: NET: Registered protocol family 40
-- Reboot --
$ grep watchdog0 sos_commands/devices/udevadm_info_--export-db -A 4
P: /devices/virtual/watchdog/watchdog0
N: watchdog0
E: DEVNAME=/dev/watchdog0
E: DEVPATH=/devices/virtual/watchdog/watchdog0
E: MAJOR=248
E: MINOR=0
E: SUBSYSTEM=watchdog
- The journal logs indicate that the system rebooted multiple times due to the unexpected closure of the character device
/dev/watchdog0
. - Before each reboot event, the Oracle Clusterware System Monitor Daemon (OSYSMOND) was started.
$ egrep "Oracle Clusterware OSYSMOND|watchdog0: watchdog did not stop" sos_commands/logs/journalctl_--no-pager -B1
Feb 01 21:29:16 node1 osysmond.bin[29127]: Oracle Clusterware: 2025-02-01 21:29:16.478
[(29127)]CRS-8500:Oracle Clusterware OSYSMOND process is starting with operating system process ID 29127
Feb 01 21:29:17 node1 kernel: watchdog: watchdog0: watchdog did not stop!
Feb 01 21:29:17 node1 kernel: watchdog: watchdog0: watchdog did not stop!
....
Feb 02 00:30:52 node1 osysmond.bin[49562]: Oracle Clusterware: 2025-02-02 00:30:52.839
[(49562)]CRS-8500:Oracle Clusterware OSYSMOND process is starting with operating system process ID 49562
Feb 02 00:30:54 node1 kernel: watchdog: watchdog0: watchdog did not stop!
Feb 02 00:30:54 node1 kernel: watchdog: watchdog0: watchdog did not stop!
....
Feb 02 04:19:06 node1 osysmond.bin[134935]: Oracle Clusterware: 2025-02-02 04:19:06.603
[(134935)]CRS-8500:Oracle Clusterware OSYSMOND process is starting with operating system process ID 134935
Feb 02 04:19:08 node1 kernel: watchdog: watchdog0: watchdog did not stop!
Feb 02 04:19:08 node1 kernel: watchdog: watchdog0: watchdog did not stop!
....
$ tail -n 3 sos_commands/ipmitool/ipmitool_sel_elist
67 | 02/02/2025 | 00:32:25 | Watchdog2 OS Watchdog Time | Hard reset (Hard Reset|Interrupt type None,SMS/OS Timer used at expiration) | Asserted
68 | 02/02/2025 | 00:47:16 | Watchdog2 OS Watchdog Time | Hard reset (Hard Reset|Interrupt type None,SMS/OS Timer used at expiration) | Asserted
69 | 02/02/2025 | 04:20:40 | Watchdog2 OS Watchdog Time | Hard reset (Hard Reset|Interrupt type None,SMS/OS Timer used at expiration) | Asserted
This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.
Comments