The kernel crashes due to the hard lockup where the tvec_base value is way behind the current jiffies value on all CPUs.
Issue
- The kernel crashes due to the hard lockup.
[184510.191400] Watchdog detected hard LOCKUP on cpu 22
[184510.191490] Modules linked in: iptable_filter ip_tables ipmi_watchdog ipmi_devintf nfnetlink_queue nfnetlink_log nfnetlink bluetooth rfkill ktap_90265(U) oracleacfs(P)(U) oracleadvm(P)(U) oracleoks(P)(U) mptctl mptbase nfs lockd fscache auth_rpcgss nfs_acl sunrpc bonding ipv6 microcode power_meter acpi_ipmi ipmi_si ipmi_msghandler iTCO_wdt iTCO_vendor_support ses enclosure bnx2 serio_raw netxen_nic lpc_ich mfd_core hpilo hpwdt sg i7core_edac edac_core shpchp ext4 jbd2 mbcache dm_round_robin scsi_dh_alua sr_mod cdrom sd_mod lpfc scsi_transport_fc scsi_tgt crc_t10dif pata_acpi ata_generic ata_piix hpsa radeon ttm drm_kms_helper drm i2c_algo_bit i2c_core dm_multipath dm_mirror dm_region_hash dm_log dm_mod [last unloaded: scsi_wait_scan]
[184510.214445] Pid: 14564, comm: mdnsd.bin Tainted: P -- ------------ 2.6.32-754.39.1.el6.x86_64 #1
[184510.218058] Call Trace:
[184510.221601] <NMI> [<ffffffff810f8911>] ? watchdog_overflow_callback+0xf1/0x110
[184510.225272] [<ffffffff81131e4a>] ? __perf_event_overflow+0xaa/0x240
[184510.228931] [<ffffffff8101ef88>] ? x86_perf_event_set_period+0xf8/0x180
[184510.232626] [<ffffffff811324a4>] ? perf_event_overflow+0x14/0x20
[184510.236364] [<ffffffff810264ec>] ? intel_pmu_handle_irq+0x21c/0x4a0
[184510.240093] [<ffffffff8156086f>] ? perf_event_nmi_handler+0x3f/0xb0
[184510.243779] [<ffffffff81562380>] ? notifier_call_chain+0x50/0x80
[184510.247424] [<ffffffff815623ea>] ? atomic_notifier_call_chain+0x1a/0x20
[184510.251075] [<ffffffff810b261e>] ? notify_die+0x2e/0x30
[184510.254723] [<ffffffff8155fed9>] ? do_nmi+0xd9/0x360
[184510.258357] [<ffffffff8155f7a4>] ? nmi+0x134/0x1a3
[184510.261960] [<ffffffff8155def2>] ? _spin_lock_irqsave+0x32/0x40
[184510.265601] <<EOE>> [<ffffffff810b0861>] ? lock_hrtimer_base+0x31/0x60
[184510.269334] [<ffffffff810b1532>] ? hrtimer_try_to_cancel+0x22/0xd0
[184510.273064] [<ffffffff810b1602>] ? hrtimer_cancel+0x22/0x30
[184510.276735] [<ffffffff8155d0a0>] ? schedule_hrtimeout_range+0xd0/0x160
[184510.280420] [<ffffffff810b04c0>] ? hrtimer_wakeup+0x0/0x30
[184510.284078] [<ffffffff810b1504>] ? hrtimer_start_range_ns+0x14/0x20
[184510.287732] [<ffffffff811baad9>] ? poll_schedule_timeout+0x39/0x60
[184510.291370] [<ffffffff811bb3f5>] ? do_select+0x5d5/0x7c0
[184510.295023] [<ffffffff811babd0>] ? __pollwait+0x0/0xf0
[184510.298703] [<ffffffff811bacc0>] ? pollwake+0x0/0x60
[184510.302348] [<ffffffff811bacc0>] ? pollwake+0x0/0x60
[184510.305914] [<ffffffff811bacc0>] ? pollwake+0x0/0x60
[184510.309424] [<ffffffff811bacc0>] ? pollwake+0x0/0x60
[184510.312905] [<ffffffff811bacc0>] ? pollwake+0x0/0x60
[184510.316341] [<ffffffff811bacc0>] ? pollwake+0x0/0x60
[184510.319718] [<ffffffff811bacc0>] ? pollwake+0x0/0x60
[184510.323077] [<ffffffff811bacc0>] ? pollwake+0x0/0x60
[184510.326434] [<ffffffff811bacc0>] ? pollwake+0x0/0x60
[184510.329730] [<ffffffff811bc02a>] ? core_sys_select+0x18a/0x2c0
[184510.333006] [<ffffffff81474dbc>] ? __sys_recvmsg+0x16c/0x2e0
[184510.336265] [<ffffffff81013539>] ? read_tsc+0x9/0x20
[184510.339494] [<ffffffff810b8394>] ? ktime_get_ts+0xc4/0x100
[184510.342700] [<ffffffff81013539>] ? read_tsc+0x9/0x20
[184510.345880] [<ffffffff810b8394>] ? ktime_get_ts+0xc4/0x100
[184510.349077] [<ffffffff811bc3b7>] ? sys_select+0x47/0x110
[184510.352309] [<ffffffff81566391>] ? system_call_fastpath+0x1f/0x3a
- The task on the CPU is looping in the spinlock in kernel mode without giving other tasks a chance to run for more than 60 seconds.
PID: 14564 TASK: ffff88386032d520 CPU: 22 COMMAND: "mdnsd.bin"
...
[exception RIP: _spin_lock_irqsave+50]
RIP: ffffffff8155def2 RSP: ffff8838634ab808 RFLAGS: 00000097
RAX: 000000000000fc5a RBX: ffff8820b8915ac8 RCX: 000000000000fc59
RDX: 0000000000000282 RSI: ffff8838634ab860 RDI: ffff8820b8915a80
RBP: ffff8838634ab808 R8: ffff8838634a8000 R9: 00000000ffffffff
R10: 0000000000000000 R11: 0000000000000000 R12: ffff8838634ab8a8
R13: ffff8838634ab860 R14: 00000000004c4b3f R15: 0000000000000040
ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018
--- <NMI exception stack> ---
#12 [ffff8838634ab808] _spin_lock_irqsave at ffffffff8155def2
#13 [ffff8838634ab810] lock_hrtimer_base at ffffffff810b0861
#14 [ffff8838634ab840] hrtimer_try_to_cancel at ffffffff810b1532
...
- At this time, another task on another CPU is running the CPU idle routine with that spinlock being held.
PID: 0 TASK: ffff8810693a0ab0 CPU: 24 COMMAND: "swapper"
...
[exception RIP: mwait_idle_with_hints+196]
RIP: ffffffff810156d4 RSP: ffff8810693abdf8 RFLAGS: 00000046
RAX: 0000000000000010 RBX: 0000000000000010 RCX: 0000000000000001
RDX: 0000000000000000 RSI: ffff8810693abfd8 RDI: 0000000000000010
RBP: ffff8810693abe28 R8: 0000000000000003 R9: 0000000000000040
R10: 0000a624f52a6af8 R11: 0000000000000000 R12: 0000000000000001
R13: 0000000000000018 R14: 0000000000000002 R15: 16a331373037b949
ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018
--- <NMI exception stack> ---
#3 [ffff8810693abdf8] mwait_idle_with_hints at ffffffff810156d4
#4 [ffff8810693abe60] hrtimer_try_to_cancel at ffffffff810b15ab
#5 [ffff8810693abea0] hrtimer_cancel at ffffffff810b1602
#6 [ffff8810693abec0] tick_nohz_restart_sched_tick at ffffffff810beb7b
#7 [ffff8810693abef0] cpu_idle at ffffffff8100a1ad
Environment
- Red Hat Enterprise Linux 6.10.z kernel-2.6.32-754.39.1.el6
- HPE ProLiant DL580 G7
Subscriber exclusive content
A Red Hat subscription provides unlimited access to our knowledgebase, tools, and much more.