About the Kernel Watchdog

Updated -

Watchdog

The kernel keeps a watchdog that resets the system after a timeout. The idea is, if the system is getting really caught up and not able to progress, the kernel can simply reset itself.

The /dev/watchdog0 file can be opened which "arms" the watchdog. The "arming" starts a timeout that causes the system to reset after the timeout is reached. When the timeout is reached, the kernel emits the message watchdog: watchdog0: watchdog did not stop! which can be found in logs like journalctl and rsyslogd.

The watchdog is "disarmed" when the "magic character" ('V') is written to /dev/watchdog0. To prevent the watchdog from resetting the system, one simply needs to write to /dev/watchdog or even just open it. Doing either resets the timeout.

Modules

A number of watchdog kernel modules exist to provide watchdog functionality. The most common are Intel's iTCO_wdt and HP's hpwdt.

Kernel API

The kernel provides an API to interact with the watchdog. Typically via ioctl and using a macro to indicate what action to take with the watchdog. https://www.kernel.org/doc/html/latest/watchdog/watchdog-api.html

sysfs interface

/sys/class/watchdog/ contains a listing of all available watchdog devices on the system (as you can have multiple):

# ls /sys/class/watchdog/
watchdog0

Within these directories, you will have the following interfaces:

# ls /sys/class/watchdog/watchdog0/
bootstatus  dev  device  identity  max_timeout  min_timeout  nowayout  power  state  status  subsystem  timeleft  timeout  uevent

Notable interfaces are:

  • nowayout If enabled (1), the watchdog can not be disarmed. The timer can be reset as usual (opening the file or writing to it), but the watchdog will eventually reset the system if the timeout is never reset. Defaults to 0.
  • timeleft Contains the amount of time in seconds left before the timeout is reached and the system is reset. Will have the same value as timeout while the watchdog is disarmed. Re-arming the watchdog (by opening or writing to the watchdog file) will also reset this.
  • timeout The amount of time between arming the watchdog and system reset.

Tooling

wdctl

The wdctl command provided from the util-linux package allows deriving information about the watchdog:

[root@host ~]# wdctl
Device:        /dev/watchdog0
Identity:      iTCO_wdt [version 0]
Timeout:       60 seconds
Pre-timeout:    0 seconds
Timeleft:      59 seconds
FLAG           DESCRIPTION               STATUS BOOT-STATUS
KEEPALIVEPING  Keep alive ping reply          1           0
MAGICCLOSE     Supports magic close char      0           0
SETTIMEOUT     Set timeout (in seconds)       0           0

The above output shows:

  • The device file used for interacting with the watchdog
  • The kernel module providing the watchdog functionality
  • The timeout before system reset
  • If supported, a watchdog can fire something off pre-timeout seconds before system reset
  • The amount of time left in the timeout countdown
  • The series of flags of additional features the watchdog module/device supports

Note wdctl will disarm the watchdog when ran if nowayout is disabled. If your workload is dependent on the watchdog being armed while nowayout is disabled, do not run wdctl in production.

watchdog.service

Provided from the watchdog package, the service can be configured to arm and disarm the kernel watchdog. The default configuration will do nothing, and /etc/watchdog.conf must be modified to enable the watchdog service to arm and disarm watchdogs in the system. See man watchdog.conf for more info on this.

Additional Reading:

https://www.kernel.org/doc/html/latest/watchdog/index.html

https://www.kernel.org/doc/Documentation/watchdog/watchdog-api.txt

Comments