Red Hat Enterprise Linux system reboot after crash with PANIC: "kernel BUG at net/ipv4/tcp_output.c:781!"
Environment
- Red Hat Enterprise Linux 6
- Network Processor Cavium Octeon CN57XX Network Processor (CN54XX/CN55XX/CN56XX) (rev 09)
Issue
- System crashes and reboots on its own frequently.
- The vmcore-dmesg.txt shows :
<4>[29525.375284] ------------[ cut here ]------------
<2>[29525.375366] kernel BUG at net/ipv4/tcp_output.c:781!
<4>[29525.375440] invalid opcode: 0000 [#1] SMP
<4>[29525.375512] last sysfs file: /sys/devices/system/cpu/online
<4>[29525.375595] CPU 1
<4>[29525.375625] Modules linked in: aqsa_drv(U) octeon_drv(P)(U) autofs4 cpufreq_ondemand freq_table pcc_cpufreq bonding ipv6 uinput sg power_meter acpi_ipmi microcode serio_raw iTCO_wdt iTCO_vendor_support ipmi_si ipmi_msghandler hpilo hpwdt nx_nic(U) lpc_ich mfd_core i7core_edac edac_core shpchp ext4 jbd2 mbcache sd_mod crc_t10dif sr_mod cdrom hpsa pata_acpi ata_generic ata_piix radeon ttm drm_kms_helper drm i2c_algo_bit i2c_core dm_mirror dm_region_hash dm_log dm_mod [last unloaded: octeon_drv]
<4>[29525.376573]
<4>[29525.376596] Pid: 0, comm: swapper Tainted: P -- ------------ 2.6.32-573.el6.x86_64 #1 HP ProLiant DL370 G6
<4>[29525.376765] RIP: 0010:[<ffffffff814c0cab>] [<ffffffff814c0cab>] tcp_transmit_skb+0x74b/0x8b0
<4>[29525.376902] RSP: 0018:ffff8802bd803d00 EFLAGS: 00010246
<4>[29525.376981] RAX: 0000000000000140 RBX: ffff880432d13240 RCX: 0000000000000020
<4>[29525.377086] RDX: ffff88032202f400 RSI: ffff880322190480 RDI: ffff880432d13240
<4>[29525.377228] RBP: ffff8802bd803d70 R08: 0000000000000000 R09: 00000000a28e07c9
<4>[29525.377333] R10: 00000000640a190a R11: 0000000000000000 R12: ffff880322190480
<4>[29525.377437] R13: 0000000000000001 R14: 0000000000000218 R15: 0000000000000000
<4>[29525.377541] FS: 0000000000000000(0000) GS:ffff8802bd800000(0000) knlGS:0000000000000000
<4>[29525.377660] CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b
<4>[29525.377744] CR2: 00007f0b06059000 CR3: 0000000001a8d000 CR4: 00000000000007e0
<4>[29525.377848] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
<4>[29525.377951] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
<4>[29525.378055] Process swapper (pid: 0, threadinfo ffff8802b459c000, task ffff8802b4599520)
<4>[29525.378174] Stack:
<4>[29525.378205] ffff8802bd803d30 ffffffff814c01a3 ffff880432d13240 000000000000020c
<4>[29525.378361] <d> ffff8803221904b8 00000000ffffff8f ffff8802bd803d70 ffffffff814c027b
<4>[29525.378495] <d> ffff8802bd803d70 ffff880432d13240 ffff880322190480 ffff8803221904b8
<4>[29525.378634] Call Trace:
<4>[29525.378673] <IRQ>
<4>[29525.378711] [<ffffffff814c01a3>] ? tcp_established_options+0x43/0xd0
<4>[29525.378806] [<ffffffff814c027b>] ? tcp_current_mss+0x4b/0x70
<4>[29525.378891] [<ffffffff814c1ecd>] tcp_retransmit_skb+0x1dd/0x650
<4>[29525.378986] [<ffffffff810149c9>] ? sched_clock+0x9/0x10
<4>[29525.379067] [<ffffffff814c4f80>] ? tcp_write_timer+0x0/0x200
<4>[29525.379151] [<ffffffff814c4ae3>] tcp_retransmit_timer+0x1e3/0x680
<4>[29525.379243] [<ffffffff814c5118>] tcp_write_timer+0x198/0x200
<4>[29525.379349] [<ffffffff8108a4d7>] run_timer_softirq+0x197/0x340
<4>[29525.379458] [<ffffffff8103574d>] ? lapic_next_event+0x1d/0x30
<4>[29525.379550] [<ffffffff8107ffd1>] __do_softirq+0xc1/0x1e0
<4>[29525.379636] [<ffffffff810b2e9f>] ? tick_program_event+0x2f/0x40
<4>[29525.379729] [<ffffffff8100c38c>] call_softirq+0x1c/0x30
<4>[29525.379809] [<ffffffff8100fbd5>] do_softirq+0x65/0xa0
<4>[29525.379888] [<ffffffff8107fe85>] irq_exit+0x85/0x90
<4>[29525.384092] [<ffffffff815423da>] smp_apic_timer_interrupt+0x4a/0x60
<4>[29525.388323] [<ffffffff8100bc13>] apic_timer_interrupt+0x13/0x20
<4>[29525.392546] <EOI>
<4>[29525.396665] [<ffffffff812f0c4e>] ? intel_idle+0xfe/0x1b0
<4>[29525.400866] [<ffffffff812f0c31>] ? intel_idle+0xe1/0x1b0
<4>[29525.404992] [<ffffffff810149c9>] ? sched_clock+0x9/0x10
<4>[29525.409137] [<ffffffff810a89ad>] ? sched_clock_cpu+0xcd/0x110
<4>[29525.413204] [<ffffffff8143331a>] cpuidle_idle_call+0x7a/0xe0
<4>[29525.417207] [<ffffffff81009fe6>] cpu_idle+0xb6/0x110
<4>[29525.421205] [<ffffffff81531762>] start_secondary+0x2c0/0x316
<4>[29525.425080] Code: 84 24 d0 00 00 00 66 83 48 0a 08 0f b6 83 06 05 00 00 e9 60 fc ff ff 66 0f 1f 84 00 00 00 00 00 8b 8b 30 05 00 00 e9 50 ff ff ff <0f> 0b eb fe be 01 00 00 00 48 89 df 89 45 a0 e8 a1 9c ff ff 8b
<1>[29525.433424] RIP [<ffffffff814c0cab>] tcp_transmit_skb+0x74b/0x8b0
<4>[29525.437365] RSP <ffff8802bd803d00>
Resolution
- The
octeon_drvis not provided by Red Hat, thus, unsupported. Please contact your hardware vendor for support. - A possible workaround is disable the GSO (Generic Segmentation Offload) feature of all Network Interface Cards (NIC) in use:
# ethtool -K ethX gso off
Root Cause
- The fault occurs in
tcp_transmit_skb()function due to an invalid GSO segmentation on the skb from shared info.
Diagnostic Steps
- Collect sosreport of the system;
-
Collect vmcore at the time of crash;
-
From sosreport:
$ grep -i 'Octeon' lspci
0a:00.0 MIPS: Cavium, Inc. Octeon CN57XX Network Processor (CN54XX/CN55XX/CN56XX) (rev 09)
1e:00.0 MIPS: Cavium, Inc. Octeon CN57XX Network Processor (CN54XX/CN55XX/CN56XX) (rev 09)
- From modinfo output (
sos_comands/kernel/modinfo_*):
filename: /lib/modules/2.6.32-573.el6.x86_64/octeon_drv.ko
license: Cavium Networks
description: Octeon Host PCI Driver
author: Cavium Networks
srcversion: 54522CF688BF11C3E783F78
depends:
vermagic: 2.6.32-431.el6.x86_64 SMP mod_unload modversions
parm: octeon_msi:Flag for enabling MSI interrupts (int)
filename: /lib/modules/2.6.32-573.el6.x86_64/aqsa_drv.ko
license: GPL
author: AQSACOM SAS
srcversion: 0A344EDC9E18F22E7B8A029
depends: octeon_drv
vermagic: 2.6.32-431.el6.x86_64 SMP mod_unload modversions
From vmcore analysis we have:
- System Information
crash> sys
KERNEL: /cores/retrace/repos/kernel/x86_64/usr/lib/debug/lib/modules/2.6.32-573.el6.x86_64/vmlinux
DUMPFILE: /cores/retrace/tasks/425259960/crash/vmcore [PARTIAL DUMP]
CPUS: 8
DATE: Wed Feb 24 14:34:11 2016
UPTIME: 08:12:13
LOAD AVERAGE: 0.00, 0.00, 0.00
TASKS: 899
NODENAME: <nodename>
RELEASE: 2.6.32-573.el6.x86_64
VERSION: #1 SMP Wed Jul 1 18:23:37 EDT 2015
MACHINE: x86_64 (1999 Mhz)
MEMORY: 16 GB
PANIC: "[29525.375366] kernel BUG at net/ipv4/tcp_output.c:781!"
- Backtraces of the panic task shows the RIP value to be
tcp_transmit_skb()function:
crash> bt
PID: 0 TASK: ffff8802b4599520 CPU: 1 COMMAND: "swapper"
#0 [ffff8802bd8039c0] machine_kexec at ffffffff8103d1ab
#1 [ffff8802bd803a20] crash_kexec at ffffffff810cc4f2
#2 [ffff8802bd803af0] oops_end at ffffffff8153c840
#3 [ffff8802bd803b20] die at ffffffff81010f5b
#4 [ffff8802bd803b50] do_trap at ffffffff8153c094
#5 [ffff8802bd803bb0] do_invalid_op at ffffffff8100cf55
#6 [ffff8802bd803c50] invalid_op at ffffffff8100c01b
[exception RIP: tcp_transmit_skb+1867] <<----- Panic
RIP: ffffffff814c0cab RSP: ffff8802bd803d00 RFLAGS: 00010246
RAX: 0000000000000140 RBX: ffff880432d13240 RCX: 0000000000000020
RDX: ffff88032202f400 RSI: ffff880322190480 RDI: ffff880432d13240
RBP: ffff8802bd803d70 R8: 0000000000000000 R9: 00000000a28e07c9
R10: 00000000640a190a R11: 0000000000000000 R12: ffff880322190480
R13: 0000000000000001 R14: 0000000000000218 R15: 0000000000000000
ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018
#7 [ffff8802bd803d08] tcp_established_options at ffffffff814c01a3
#8 [ffff8802bd803d38] tcp_current_mss at ffffffff814c027b
#9 [ffff8802bd803d78] tcp_retransmit_skb at ffffffff814c1ecd
#10 [ffff8802bd803de8] tcp_retransmit_timer at ffffffff814c4ae3
#11 [ffff8802bd803e18] tcp_write_timer at ffffffff814c5118
#12 [ffff8802bd803e48] run_timer_softirq at ffffffff8108a4d7
#13 [ffff8802bd803ed8] __do_softirq at ffffffff8107ffd1
#14 [ffff8802bd803f48] call_softirq at ffffffff8100c38c
#15 [ffff8802bd803f60] do_softirq at ffffffff8100fbd5
#16 [ffff8802bd803f80] irq_exit at ffffffff8107fe85
#17 [ffff8802bd803f90] smp_apic_timer_interrupt at ffffffff815423da
#18 [ffff8802bd803fb0] apic_timer_interrupt at ffffffff8100bc13
--- <IRQ stack> ---
#19 [ffff8802b459fd98] apic_timer_interrupt at ffffffff8100bc13
[exception RIP: intel_idle+254]
RIP: ffffffff812f0c4e RSP: ffff8802b459fe48 RFLAGS: 00000206
RAX: 0000000000000000 RBX: ffff8802b459fed8 RCX: 0000000000000000
RDX: 00000000000003a9 RSI: 0000000000000000 RDI: 00000000000e4da6
RBP: ffffffff8100bc0e R8: 0000000000000000 R9: 00000000000000c8
R10: 0000000000000002 R11: 0000000000000000 R12: ffffffff8153e845
R13: ffff8802b459fde8 R14: 0000000000000000 R15: 0000000000000000
ORIG_RAX: ffffffffffffff10 CS: 0010 SS: 0018
#20 [ffff8802b459fee0] cpuidle_idle_call at ffffffff8143331a
#21 [ffff8802b459ff00] cpu_idle at ffffffff81009fe6
- Third-party kernel modules loaded:
crash> mod -t
NAME TAINTS
nx_nic (U)
aqsa_drv (U)
octeon_drv P(U)
This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.
Welcome! Check out the Getting Started with Red Hat page for quick tours and guides for common tasks.
