What does the `kernel: WARNING: at net/sched/sch_generic.c dev_watchdog()` error indicate?

Solution Verified - Updated -

Environment

  • Red Hat Enterprise Linux 9
  • Red Hat Enterprise Linux 8
  • Red Hat Enterprise Linux 7
  • Red Hat Enterprise Linux 6
  • Red Hat Enterprise Linux 5
  • Network interfaces using the following drivers, and possibly others:
    bnx2 bnx2x e1000 e1000e igb ixgbe netxen_nic mlx4_core niu r8169 sky2 tg3 enic ena

Issue

  • System loses network connectivity, network driver backtraces found in /var/log/messages similar to the following:
WARNING: at net/sched/sch_generic.c:... dev_watchdog+0x.../0x...() (Not tainted)
Hardware name: ...
NETDEV WATCHDOG: ethX (<drivername>): transmit queue N timed out
Modules linked in: ...
Pid: ..., comm: ... Not tainted 2.6.32-....el6.x86_64 #1
Call Trace:
<IRQ>  [<ffffffff........>] ? warn_slowpath_common+...
[<ffffffff........>] ? warn_slowpath_fmt+...
[<ffffffff........>] ? dev_watchdog+...
[<ffffffff........>] ? run_timer_softirq+...
...
[<ffffffff........>] ? __do_softirq+...
...
[<ffffffff........>] ? call_softirq+...
[<ffffffff........>] ? do_softirq+...
[<ffffffff........>] ? irq_exit+...
...
<EOI>  [<ffffffff........>] ? ...
...

Resolution

The NETDEV WATCHDOG message is the kernel's way of saying "This network device has not been transmitting data for a few seconds, even though it has data to transmit."

The watchdog message does not indicate why the device stopped transmitting. It may be due to a hardware error or a software (kernel/driver/BIOS/firmware) bug.

Red Hat Engineering has Private Bugs open for each individual driver where this issue has being seen.

As the NETDEV WATCHDOG hang is a symptom of an issue, not an actual issue itself, the root cause of the NETDEV WATCHDOG hangs must be investigated on an individual basis.

Please open a case with Red Hat Global Support Services, supplying a full sosreport, and as much of the following information as possible:

  • Full dmesg (not just an excerpt with the NETDEV WATCHDOG message and the stack trace, for the reasons explained above).
  • Information about the affected hardware (sosreport should be fine).
  • Did the network interface recover automatically shortly afterwards? Or can connectivity be restored by doing ifdown followed by ifup? Or can connectivity be restored by rmmod followed by modprobe of the driver? Or is reboot the only way to make the device work again?
  • How often does the issue occur?
  • Does the occurrence of the issue seem to correlate with specific workloads? Is there a way to reproduce it, or at least to make it more likely to happen?
  • Do any of these kernel boot parameters help?:
pcie_aspm=off  (ASPM has been known to cause problems in the past.)
intremap=off   (Interrupt remapping has been known to cause lost
                interrupts in conjunction with irqbalance
                on some platforms, e.g. bug 887006)

Some additional troubleshooting steps which may help prevent the issue from re-occurring:

  • Update the kernel package to the latest version, which will supply the latest available driver with fixes for known issues.
  • Update the system BIOS and network interface firmware.
  • Ensure irqbalance is running

This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.

Comments