Red Hat Enterprise Linux system reboot after crash with PANIC: "kernel BUG at net/ipv4/tcp_output.c:781!"

Solution Verified - Updated -

Environment

  • Red Hat Enterprise Linux 6
  • Network Processor Cavium Octeon CN57XX Network Processor (CN54XX/CN55XX/CN56XX) (rev 09)

Issue

  • System crashes and reboots on its own frequently.
  • The vmcore-dmesg.txt shows :
<4>[29525.375284] ------------[ cut here ]------------
<2>[29525.375366] kernel BUG at net/ipv4/tcp_output.c:781!
<4>[29525.375440] invalid opcode: 0000 [#1] SMP
<4>[29525.375512] last sysfs file: /sys/devices/system/cpu/online
<4>[29525.375595] CPU 1 
<4>[29525.375625] Modules linked in: aqsa_drv(U) octeon_drv(P)(U) autofs4 cpufreq_ondemand freq_table pcc_cpufreq bonding ipv6 uinput sg power_meter acpi_ipmi microcode serio_raw iTCO_wdt iTCO_vendor_support ipmi_si ipmi_msghandler hpilo hpwdt nx_nic(U) lpc_ich mfd_core i7core_edac edac_core shpchp ext4 jbd2 mbcache sd_mod crc_t10dif sr_mod cdrom hpsa pata_acpi ata_generic ata_piix radeon ttm drm_kms_helper drm i2c_algo_bit i2c_core dm_mirror dm_region_hash dm_log dm_mod [last unloaded: octeon_drv]
<4>[29525.376573]  
<4>[29525.376596] Pid: 0, comm: swapper Tainted: P           -- ------------    2.6.32-573.el6.x86_64 #1 HP ProLiant DL370 G6
<4>[29525.376765] RIP: 0010:[<ffffffff814c0cab>]  [<ffffffff814c0cab>] tcp_transmit_skb+0x74b/0x8b0
<4>[29525.376902] RSP: 0018:ffff8802bd803d00  EFLAGS: 00010246
<4>[29525.376981] RAX: 0000000000000140 RBX: ffff880432d13240 RCX: 0000000000000020
<4>[29525.377086] RDX: ffff88032202f400 RSI: ffff880322190480 RDI: ffff880432d13240
<4>[29525.377228] RBP: ffff8802bd803d70 R08: 0000000000000000 R09: 00000000a28e07c9
<4>[29525.377333] R10: 00000000640a190a R11: 0000000000000000 R12: ffff880322190480
<4>[29525.377437] R13: 0000000000000001 R14: 0000000000000218 R15: 0000000000000000
<4>[29525.377541] FS:  0000000000000000(0000) GS:ffff8802bd800000(0000) knlGS:0000000000000000
<4>[29525.377660] CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
<4>[29525.377744] CR2: 00007f0b06059000 CR3: 0000000001a8d000 CR4: 00000000000007e0
<4>[29525.377848] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
<4>[29525.377951] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
<4>[29525.378055] Process swapper (pid: 0, threadinfo ffff8802b459c000, task ffff8802b4599520)
<4>[29525.378174] Stack:
<4>[29525.378205]  ffff8802bd803d30 ffffffff814c01a3 ffff880432d13240 000000000000020c
<4>[29525.378361] <d> ffff8803221904b8 00000000ffffff8f ffff8802bd803d70 ffffffff814c027b
<4>[29525.378495] <d> ffff8802bd803d70 ffff880432d13240 ffff880322190480 ffff8803221904b8
<4>[29525.378634] Call Trace:
<4>[29525.378673]  <IRQ>
<4>[29525.378711]  [<ffffffff814c01a3>] ? tcp_established_options+0x43/0xd0
<4>[29525.378806]  [<ffffffff814c027b>] ? tcp_current_mss+0x4b/0x70
<4>[29525.378891]  [<ffffffff814c1ecd>] tcp_retransmit_skb+0x1dd/0x650
<4>[29525.378986]  [<ffffffff810149c9>] ? sched_clock+0x9/0x10
<4>[29525.379067]  [<ffffffff814c4f80>] ? tcp_write_timer+0x0/0x200
<4>[29525.379151]  [<ffffffff814c4ae3>] tcp_retransmit_timer+0x1e3/0x680
<4>[29525.379243]  [<ffffffff814c5118>] tcp_write_timer+0x198/0x200
<4>[29525.379349]  [<ffffffff8108a4d7>] run_timer_softirq+0x197/0x340
<4>[29525.379458]  [<ffffffff8103574d>] ? lapic_next_event+0x1d/0x30
<4>[29525.379550]  [<ffffffff8107ffd1>] __do_softirq+0xc1/0x1e0
<4>[29525.379636]  [<ffffffff810b2e9f>] ? tick_program_event+0x2f/0x40
<4>[29525.379729]  [<ffffffff8100c38c>] call_softirq+0x1c/0x30
<4>[29525.379809]  [<ffffffff8100fbd5>] do_softirq+0x65/0xa0
<4>[29525.379888]  [<ffffffff8107fe85>] irq_exit+0x85/0x90
<4>[29525.384092]  [<ffffffff815423da>] smp_apic_timer_interrupt+0x4a/0x60
<4>[29525.388323]  [<ffffffff8100bc13>] apic_timer_interrupt+0x13/0x20
<4>[29525.392546]  <EOI>
<4>[29525.396665]  [<ffffffff812f0c4e>] ? intel_idle+0xfe/0x1b0
<4>[29525.400866]  [<ffffffff812f0c31>] ? intel_idle+0xe1/0x1b0
<4>[29525.404992]  [<ffffffff810149c9>] ? sched_clock+0x9/0x10
<4>[29525.409137]  [<ffffffff810a89ad>] ? sched_clock_cpu+0xcd/0x110
<4>[29525.413204]  [<ffffffff8143331a>] cpuidle_idle_call+0x7a/0xe0
<4>[29525.417207]  [<ffffffff81009fe6>] cpu_idle+0xb6/0x110
<4>[29525.421205]  [<ffffffff81531762>] start_secondary+0x2c0/0x316
<4>[29525.425080] Code: 84 24 d0 00 00 00 66 83 48 0a 08 0f b6 83 06 05 00 00 e9 60 fc ff ff 66 0f 1f 84 00 00 00 00 00 8b 8b 30 05 00 00 e9 50 ff ff ff <0f> 0b eb fe be 01 00 00 00 48 89 df 89 45 a0 e8 a1 9c ff ff 8b
<1>[29525.433424] RIP  [<ffffffff814c0cab>] tcp_transmit_skb+0x74b/0x8b0
<4>[29525.437365]  RSP <ffff8802bd803d00>

Resolution

  • The octeon_drv is not provided by Red Hat, thus, unsupported. Please contact your hardware vendor for support.
  • A possible workaround is disable the GSO (Generic Segmentation Offload) feature of all Network Interface Cards (NIC) in use:
# ethtool -K ethX gso off

Root Cause

  • The fault occurs in tcp_transmit_skb() function due to an invalid GSO segmentation on the skb from shared info.

Diagnostic Steps

  • Collect sosreport of the system;
  • Collect vmcore at the time of crash;

  • From sosreport:

$ grep -i 'Octeon' lspci 
0a:00.0 MIPS: Cavium, Inc. Octeon CN57XX Network Processor (CN54XX/CN55XX/CN56XX) (rev 09)
1e:00.0 MIPS: Cavium, Inc. Octeon CN57XX Network Processor (CN54XX/CN55XX/CN56XX) (rev 09)
  • From modinfo output (sos_comands/kernel/modinfo_*):
filename:       /lib/modules/2.6.32-573.el6.x86_64/octeon_drv.ko
license:        Cavium Networks
description:    Octeon Host PCI Driver
author:         Cavium Networks
srcversion:     54522CF688BF11C3E783F78
depends:
vermagic:       2.6.32-431.el6.x86_64 SMP mod_unload modversions
parm:           octeon_msi:Flag for enabling MSI interrupts (int)

filename:       /lib/modules/2.6.32-573.el6.x86_64/aqsa_drv.ko
license:        GPL
author:         AQSACOM SAS
srcversion:     0A344EDC9E18F22E7B8A029
depends:        octeon_drv
vermagic:       2.6.32-431.el6.x86_64 SMP mod_unload modversions

From vmcore analysis we have:

  • System Information
crash> sys
           KERNEL: /cores/retrace/repos/kernel/x86_64/usr/lib/debug/lib/modules/2.6.32-573.el6.x86_64/vmlinux
       DUMPFILE: /cores/retrace/tasks/425259960/crash/vmcore  [PARTIAL DUMP]
                CPUS: 8
                 DATE: Wed Feb 24 14:34:11 2016
             UPTIME: 08:12:13
LOAD AVERAGE: 0.00, 0.00, 0.00
                 TASKS: 899
       NODENAME: <nodename> 
            RELEASE: 2.6.32-573.el6.x86_64
             VERSION: #1 SMP Wed Jul 1 18:23:37 EDT 2015
            MACHINE: x86_64  (1999 Mhz)
             MEMORY: 16 GB
                  PANIC: "[29525.375366] kernel BUG at net/ipv4/tcp_output.c:781!"
  • Backtraces of the panic task shows the RIP value to be tcp_transmit_skb() function:
crash> bt
PID: 0      TASK: ffff8802b4599520  CPU: 1   COMMAND: "swapper"
 #0 [ffff8802bd8039c0] machine_kexec at ffffffff8103d1ab
 #1 [ffff8802bd803a20] crash_kexec at ffffffff810cc4f2
 #2 [ffff8802bd803af0] oops_end at ffffffff8153c840
 #3 [ffff8802bd803b20] die at ffffffff81010f5b
 #4 [ffff8802bd803b50] do_trap at ffffffff8153c094
 #5 [ffff8802bd803bb0] do_invalid_op at ffffffff8100cf55
 #6 [ffff8802bd803c50] invalid_op at ffffffff8100c01b
    [exception RIP: tcp_transmit_skb+1867]                               <<----- Panic
    RIP: ffffffff814c0cab  RSP: ffff8802bd803d00  RFLAGS: 00010246
    RAX: 0000000000000140  RBX: ffff880432d13240  RCX: 0000000000000020
    RDX: ffff88032202f400  RSI: ffff880322190480  RDI: ffff880432d13240
    RBP: ffff8802bd803d70   R8: 0000000000000000   R9: 00000000a28e07c9
    R10: 00000000640a190a  R11: 0000000000000000  R12: ffff880322190480
    R13: 0000000000000001  R14: 0000000000000218  R15: 0000000000000000
    ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
 #7 [ffff8802bd803d08] tcp_established_options at ffffffff814c01a3
 #8 [ffff8802bd803d38] tcp_current_mss at ffffffff814c027b
 #9 [ffff8802bd803d78] tcp_retransmit_skb at ffffffff814c1ecd
#10 [ffff8802bd803de8] tcp_retransmit_timer at ffffffff814c4ae3
#11 [ffff8802bd803e18] tcp_write_timer at ffffffff814c5118
#12 [ffff8802bd803e48] run_timer_softirq at ffffffff8108a4d7
#13 [ffff8802bd803ed8] __do_softirq at ffffffff8107ffd1
#14 [ffff8802bd803f48] call_softirq at ffffffff8100c38c
#15 [ffff8802bd803f60] do_softirq at ffffffff8100fbd5
#16 [ffff8802bd803f80] irq_exit at ffffffff8107fe85
#17 [ffff8802bd803f90] smp_apic_timer_interrupt at ffffffff815423da
#18 [ffff8802bd803fb0] apic_timer_interrupt at ffffffff8100bc13
--- <IRQ stack> ---
#19 [ffff8802b459fd98] apic_timer_interrupt at ffffffff8100bc13
    [exception RIP: intel_idle+254]
    RIP: ffffffff812f0c4e  RSP: ffff8802b459fe48  RFLAGS: 00000206
    RAX: 0000000000000000  RBX: ffff8802b459fed8  RCX: 0000000000000000
    RDX: 00000000000003a9  RSI: 0000000000000000  RDI: 00000000000e4da6
    RBP: ffffffff8100bc0e   R8: 0000000000000000   R9: 00000000000000c8
    R10: 0000000000000002  R11: 0000000000000000  R12: ffffffff8153e845
    R13: ffff8802b459fde8  R14: 0000000000000000  R15: 0000000000000000
    ORIG_RAX: ffffffffffffff10  CS: 0010  SS: 0018
#20 [ffff8802b459fee0] cpuidle_idle_call at ffffffff8143331a
#21 [ffff8802b459ff00] cpu_idle at ffffffff81009fe6
  • Third-party kernel modules loaded:
crash> mod -t
NAME        TAINTS
nx_nic      (U)
aqsa_drv    (U)
octeon_drv  P(U)

This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.

Close

Welcome! Check out the Getting Started with Red Hat page for quick tours and guides for common tasks.