About the Kernel Watchdog
Watchdog
The kernel keeps a watchdog that resets the system after a timeout. The idea is, if the system is getting really caught up and not able to progress, the kernel can simply reset itself.
The /dev/watchdog0 file can be opened which "arms" the watchdog. The "arming" starts a timeout that causes the system to reset after the timeout is reached. When the timeout is reached, the kernel emits the message watchdog: watchdog0: watchdog did not stop! which can be found in logs like journalctl and rsyslogd.
The watchdog is "disarmed" when the "magic character" ('V') is written to /dev/watchdog0. To prevent the watchdog from resetting the system, one simply needs to write to /dev/watchdog or even just open it. Doing either resets the timeout.
Modules
A number of watchdog kernel modules exist to provide watchdog functionality. The most common are Intel's iTCO_wdt and HP's hpwdt.
Kernel API
The kernel provides an API to interact with the watchdog. Typically via ioctl and using a macro to indicate what action to take with the watchdog. https://www.kernel.org/doc/html/latest/watchdog/watchdog-api.html
sysfs interface
/sys/class/watchdog/ contains a listing of all available watchdog devices on the system (as you can have multiple):
# ls /sys/class/watchdog/
watchdog0
Within these directories, you will have the following interfaces:
# ls /sys/class/watchdog/watchdog0/
bootstatus  dev  device  identity  max_timeout  min_timeout  nowayout  power  state  status  subsystem  timeleft  timeout  uevent
Notable interfaces are:
- nowayoutIf enabled (1), the watchdog can not be disarmed. The timer can be reset as usual (opening the file or writing to it), but the watchdog will eventually reset the system if the timeout is never reset. Defaults to 0.
- timeleftContains the amount of time in seconds left before the timeout is reached and the system is reset. Will have the same value as timeout while the watchdog is disarmed. Re-arming the watchdog (by opening or writing to the watchdog file) will also reset this.
- timeoutThe amount of time between arming the watchdog and system reset.
Tooling
wdctl
The wdctl command provided from the util-linux package allows deriving information about the watchdog:
[root@host ~]# wdctl
Device:        /dev/watchdog0
Identity:      iTCO_wdt [version 0]
Timeout:       60 seconds
Pre-timeout:    0 seconds
Timeleft:      59 seconds
FLAG           DESCRIPTION               STATUS BOOT-STATUS
KEEPALIVEPING  Keep alive ping reply          1           0
MAGICCLOSE     Supports magic close char      0           0
SETTIMEOUT     Set timeout (in seconds)       0           0
The above output shows:
- The device file used for interacting with the watchdog
- The kernel module providing the watchdog functionality
- The timeout before system reset
- If supported, a watchdog can fire something off pre-timeout seconds before system reset
- The amount of time left in the timeout countdown
- The series of flags of additional features the watchdog module/device supports
Note wdctl will disarm the watchdog when ran if nowayout is disabled. If your workload is dependent on the watchdog being armed while nowayout is disabled, do not run wdctl in production.
watchdog.service
Provided from the watchdog package, the service can be configured to arm and disarm the kernel watchdog. The default configuration will do nothing, and /etc/watchdog.conf must be modified to enable the watchdog service to arm and disarm watchdogs in the system. See man watchdog.conf for more info on this.
Additional Reading:
https://www.kernel.org/doc/html/latest/watchdog/index.html
https://www.kernel.org/doc/Documentation/watchdog/watchdog-api.txt
Comments