VMware guest eth0 stops responding after logically offlining CPUs

Solution Unverified - Updated -

Issue

  • Within 10 seconds of offlining a number of virtual CPUs, eth0 begins to suffer NETDEV WATCHDOG timeout errors. In the observed instances, 20 virtual CPUs are attempted to be offlined by doing the following in a for-loop as fast as possible.
echo 0 > /sys/devices/system/cpu/cpuX/online
  • Following messages are seen in kernel ring buffer
CPU 7 is now offline
CPU 7 offline: Remove Rx thread
Breaking affinity for irq 15
CPU 6 is now offline
CPU 6 offline: Remove Rx thread
CPU 5 is now offline
CPU 5 offline: Remove Rx thread
CPU 4 is now offline
CPU 4 offline: Remove Rx thread
NETDEV WATCHDOG: eth0: transmit timed out
eth0: tx hang
eth0: resetting
NETDEV WATCHDOG: eth0: transmit timed out
eth0: tx hang
NETDEV WATCHDOG: eth0: transmit timed out
eth0: tx hang
..
NMI received for unknown reason 20
WARNING: at kernel/softirq.c:138 local_bh_enable()

Call Trace:
 <NMI>  [<ffffffff8002c131>] local_bh_enable+0x44/0x99
 [<ffffffff80058239>] __ip_route_output_key+0xaf/0x816
 [<ffffffff8025634c>] ip_route_output_flow+0x18/0x1ee
 [<ffffffff8025abbf>] ip_send_reply+0x11a/0x25b
 [<ffffffff80266a14>] tcp_v4_send_reset+0x104/0x141
 [<ffffffff80037d93>] ip_route_input+0xb2f/0xcbe
 [<ffffffff80027978>] tcp_v4_rcv+0xa2c/0xa71
 [<ffffffff80034c37>] ip_local_deliver+0x19f/0x265
 [<ffffffff80035dc9>] ip_rcv+0x539/0x57c
 [<ffffffff8002108c>] netif_receive_skb+0x48c/0x4ae
 [<ffffffff881fc15d>] :vmxnet3:vmxnet3_rq_rx_complete+0xa0f/0xb7f
 [<ffffffff881fc33a>] :vmxnet3:vmxnet3_do_poll+0x6d/0x88
 [<ffffffff881fde52>] :vmxnet3:vmxnet3_netpoll+0x27/0x30
 [<ffffffff80246cfd>] netpoll_poll_dev+0x47/0x36c
 [<ffffffff802470fc>] netpoll_send_skb_on_dev+0xda/0xef
 [<ffffffff886ee0e1>] :netconsole:write_msg+0x49/0x60
 [<ffffffff800941f6>] __call_console_drivers+0x5b/0x69
 [<ffffffff80017329>] release_console_sem+0x13e/0x200
 [<ffffffff800949eb>] vprintk+0x2b2/0x317
 [<ffffffff80094aa2>] printk+0x52/0xbd
 [<ffffffff8006503a>] oops_begin+0x5e/0x65
 [<ffffffff80065263>] die_nmi+0x24/0xa3
 [<ffffffff80079832>] do_nmi_callback2+0x33/0x3c
 [<ffffffff8006561c>] default_do_nmi+0x94/0x22f
 [<ffffffff80065903>] do_nmi+0x43/0x61
 [<ffffffff80064ecf>] nmi+0x7f/0x88
 [<ffffffff8006bdfb>] default_idle+0x0/0x50
 [<ffffffff8006be24>] default_idle+0x29/0x50
 <<EOE>>  [<ffffffff80049614>] cpu_idle+0x95/0xb8
 [<ffffffff80471809>] start_kernel+0x220/0x225
 [<ffffffff8047122f>] _sinittext+0x22f/0x236

WARNING: at net/core/skbuff.c:417 skb_release_head_state()

Call Trace:
 <NMI>  [<ffffffff80236997>] skb_release_head_state+0xa2/0xf8
 [<ffffffff800291d4>] __kfree_skb+0x9/0x1a
 [<ffffffff80240653>] __neigh_event_send+0x115/0x167
 [<ffffffff80057157>] neigh_resolve_output+0x81/0x269
 [<ffffffff80059102>] ip_append_data+0x697/0xa43
 [<ffffffff800322f2>] ip_output+0x2ae/0x2dd
 [<ffffffff8025a9c5>] ip_push_pending_frames+0x381/0x461
 [<ffffffff8025aca4>] ip_send_reply+0x1ff/0x25b
 [<ffffffff80266a14>] tcp_v4_send_reset+0x104/0x141
 [<ffffffff80037d93>] ip_route_input+0xb2f/0xcbe
 [<ffffffff80027978>] tcp_v4_rcv+0xa2c/0xa71
 [<ffffffff80034c37>] ip_local_deliver+0x19f/0x265
 [<ffffffff80035dc9>] ip_rcv+0x539/0x57c
 [<ffffffff8002108c>] netif_receive_skb+0x48c/0x4ae
 [<ffffffff881fc15d>] :vmxnet3:vmxnet3_rq_rx_complete+0xa0f/0xb7f
 [<ffffffff881fc33a>] :vmxnet3:vmxnet3_do_poll+0x6d/0x88
 [<ffffffff881fde52>] :vmxnet3:vmxnet3_netpoll+0x27/0x30
 [<ffffffff80246cfd>] netpoll_poll_dev+0x47/0x36c
 [<ffffffff802470fc>] netpoll_send_skb_on_dev+0xda/0xef
 [<ffffffff886ee0e1>] :netconsole:write_msg+0x49/0x60
 [<ffffffff800941f6>] __call_console_drivers+0x5b/0x69
 [<ffffffff80017329>] release_console_sem+0x13e/0x200
 [<ffffffff800949eb>] vprintk+0x2b2/0x317
 [<ffffffff80094aa2>] printk+0x52/0xbd
 [<ffffffff8006503a>] oops_begin+0x5e/0x65
 [<ffffffff80065263>] die_nmi+0x24/0xa3
 [<ffffffff80079832>] do_nmi_callback2+0x33/0x3c
 [<ffffffff8006561c>] default_do_nmi+0x94/0x22f
 [<ffffffff80065903>] do_nmi+0x43/0x61
 [<ffffffff80064ecf>] nmi+0x7f/0x88
 [<ffffffff8006bdfb>] default_idle+0x0/0x50
 [<ffffffff8006be24>] default_idle+0x29/0x50
 <<EOE>>  [<ffffffff80049614>] cpu_idle+0x95/0xb8
 [<ffffffff80471809>] start_kernel+0x220/0x225
 [<ffffffff8047122f>] _sinittext+0x22f/0x236
  • Server got hung during FLIP activity

Environment

  • Red Hat Enterprise Linux 5.8
  • vmxnet3 VMware driver
  • VMware ESXi 5.0.0 build-721882
  • Hardware: Model: HP BL460C Gen8

Subscriber exclusive content

A Red Hat subscription provides unlimited access to our knowledgebase, tools, and much more.

Current Customers and Partners

Log in for full access

Log In

New to Red Hat?

Learn more about Red Hat subscriptions

Using a Red Hat product through a public cloud?

How to access this content