The server crashed with multiple hard lockups. Lots of MCE events were being logged before the crash.

Solution Verified - Updated -

Issue

  • The server crashed with multiple hard lockups. Lots of MCE events were being logged before the crash.
[1780915.816854] mce: [Hardware Error]: Machine check events logged
[2176919.599198] mce: [Hardware Error]: Machine check events logged
[2829531.332234] mce: [Hardware Error]: Machine check events logged
[2919937.665378] mce: [Hardware Error]: Machine check events logged
[3038978.496527] mce: [Hardware Error]: Machine check events logged
[3323876.767276] mce: [Hardware Error]: Machine check events logged
[3647175.065549] mce: [Hardware Error]: Machine check events logged
[3731430.668298] mce: [Hardware Error]: Machine check events logged
[3886067.562527] mce: [Hardware Error]: Machine check events logged
[4473983.637762] mce: [Hardware Error]: Machine check events logged
[4567374.662809] mce: [Hardware Error]: Machine check events logged
[4923388.114969] mce: [Hardware Error]: Machine check events logged
[5000803.805808] mce: [Hardware Error]: Machine check events logged
[5518009.178445] mce: [Hardware Error]: Machine check events logged
[5856232.630047] mce: [Hardware Error]: Machine check events logged
[7221486.591468] mce: [Hardware Error]: Machine check events logged
[7221486.606332] mce: [Hardware Error]: Machine check events logged
[7825137.868388] mce: [Hardware Error]: Machine check events logged
[7917809.176188] mce: [Hardware Error]: Machine check events logged
[8157046.692590] mce: [Hardware Error]: Machine check events logged
[8611445.971314] mce: [Hardware Error]: Machine check events logged
[8752957.459324] mce: [Hardware Error]: Machine check events logged
[9135446.374342] mce: [Hardware Error]: Machine check events logged
[9723422.396207] mce: [Hardware Error]: Machine check events logged
[9798483.563618] mce: [Hardware Error]: Machine check events logged
[10162160.576292] mce: [Hardware Error]: Machine check events logged
[10162251.862879] mce: [Hardware Error]: Machine check events logged
[10162251.964029] mce: [Hardware Error]: Machine check events logged
[10162260.577124] CMCI storm detected: switching to poll mode
[10162260.734937] INFO: NMI handler (perf_event_nmi_handler) took too long to run: 1.066 msecs
[10162260.736870] INFO: NMI handler (perf_event_nmi_handler) took too long to run: 1.903 msecs
[10162260.738734] INFO: NMI handler (perf_event_nmi_handler) took too long to run: 156.988 msecs
[10162260.738749] hrtimer: interrupt took 2800405 ns
[10162260.738752] perf: interrupt took too long (22821 > 16611), lowering kernel.perf_event_max_sample_rate to 8000
[10162262.720130] perf: interrupt took too long (40302 > 28526), lowering kernel.perf_event_max_sample_rate to 4000
[10162268.024932] sched: RT throttling activated
[10162277.031346] INFO: NMI handler (perf_event_nmi_handler) took too long to run: 1000.682 msecs
[10162278.032120] INFO: NMI handler (perf_event_nmi_handler) took too long to run: 2001.366 msecs
[10162278.032128] perf: interrupt took too long (15651287 > 9794597), lowering kernel.perf_event_max_sample_rate to 1000
[10162278.032133] INFO: NMI handler (perf_event_nmi_handler) took too long to run: 7005.029 msecs
[10162278.032136] perf: interrupt took too long (54735333 > 19564108), lowering kernel.perf_event_max_sample_rate to 1000
[10162287.038868] NMI watchdog: BUG: soft lockup - CPU#23 stuck for 23s! [titanagent:2336749]
[10162288.039250] Modules linked in: target_core_pscsi target_core_file target_core_iblock binfmt_misc tcp_diag inet_diag target_core_user uio rpcrdma ib_isert sunrpc iscsi_target_mod
[10162288.039252] NMI watchdog: Watchdog detected hard LOCKUP on cpu 32
[10162288.039318] Modules linked in: target_core_pscsi target_core_file target_core_iblock binfmt_misc tcp_diag inet_diag target_core_user uio rpcrdma ib_isert sunrpc iscsi_target_mod ib_iser libiscsi scsi_transport_iscsi ib_srpt target_core_mod ib_srp scsi_transport_srp scsi_tgt ib_ipoib rdma_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm vfat fat intel_powerclamp coretemp kvm_intel kvm irqbypass crc32_pclmul ghash_clmulni_intel aesni_intel lrw gf128mul glue_helper ablk_helper ocrdma(T) cdc_ether iTCO_wdt ib_core gpio_ich iTCO_vendor_support pcspkr cryptd usbnet lpc_ich mii sg ioatdma i2c_i801 i7core_edac ipmi_si ipmi_devintf pcc_cpufreq ipmi_msghandler acpi_cpufreq ip_tables xfs libcrc32c sr_mod cdrom ata_generic pata_acpi sd_mod crc_t10dif crct10dif_generic mgag200 drm_kms_helper syscopyarea ata_piix sysfillrect
[10162288.039336]  sysimgblt igb fb_sys_fops ptp ttm crct10dif_pclmul drm crct10dif_common pps_core crc32c_intel dca libata serio_raw megaraid_sas be2net bnx2 drm_panel_orientation_quirks i2c_algo_bit dm_mirror dm_region_hash dm_log dm_mod
[10162288.039341] CPU: 32 PID: 4476 Comm: tp_osd_tp Kdump: loaded Tainted: G          I    ------------ T 3.10.0-1062.21.1.el7.x86_64 #1
[10162288.039342] Hardware name: IBM System x3850 X5 -[7143SPM]-/Node 1, Processor Card, BIOS -[G0E173BUS-1.73]- 01/21/2012
[10162288.039344] task: ffff88b5dc66c1c0 ti: ffff88b5dbb68000 task.ti: ffff88b5dbb68000
[10162288.039352] RIP: 0010:[<ffffffff938d51b4>]  [<ffffffff938d51b4>] finish_task_switch+0x54/0x1c0
[10162288.039353] RSP: 0018:ffff88b5dbb6bc08  EFLAGS: 00000282
[10162288.039355] RAX: ffff88b5d6805230 RBX: ffff88b5dbb6bbb8 RCX: 00000000c0000100
[10162288.039356] RDX: ffff88b5dbb69fd8 RSI: ffff88b5dc66c1c0 RDI: ffff88a5dfc1ac80
[10162288.039358] RBP: ffff88b5dbb6bc28 R08: ffff88b5dbb68000 R09: 0000000000000001
[10162288.039359] R10: 0000000000000001 R11: ffff88bddd638a00 R12: ffff88b5dbb6bbb8
[10162288.039360] R13: ffffffff938e507c R14: ffff88b5dbb6bbb0 R15: 0000000000000001
[10162288.039363] FS:  00007f5a92131700(0000) GS:ffff88a5dfc00000(0000) knlGS:0000000000000000
[10162288.039364] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[10162288.039366] CR2: 000055e5485cd698 CR3: 000000205d9f8000 CR4: 00000000000207e0
[10162288.039367] Call Trace:
[10162288.039373]  [<ffffffff93f80d92>] __schedule+0x402/0x840
[10162288.039376]  [<ffffffff93f811f9>] schedule+0x29/0x70
[10162288.039384]  [<ffffffff93912306>] futex_wait_queue_me+0xc6/0x130
[10162288.039387]  [<ffffffff939130ab>] futex_wait+0x17b/0x280
[10162288.039390]  [<ffffffff938d7959>] ? ttwu_do_wakeup+0x19/0xe0
[10162288.039396]  [<ffffffff938c9fe0>] ? hrtimer_get_res+0x50/0x50
[10162288.039399]  [<ffffffff939122e4>] ? futex_wait_queue_me+0xa4/0x130
[10162288.039402]  [<ffffffff93914df6>] do_futex+0x106/0x5a0
[10162288.039406]  [<ffffffff93915310>] SyS_futex+0x80/0x190
[10162288.039411]  [<ffffffff93f8dede>] system_call_fastpath+0x25/0x2a
[10162288.039414]  [<ffffffff93f8de21>] ? system_call_after_swapgs+0xae/0x146
[10162288.039440] Code: 8b 36 66 66 66 66 90 65 48 8b 34 25 80 0e 01 00 66 66 66 66 90 41 c7 45 28 00 00 00 00 48 89 df c6 07 00 66 66 66 90 fb 66 66 90 <66> 66 90 65 48 8b 04 25 80 0e 01 00 48 8b 98 78 01 00 00 48 85 
[10162288.039442] Kernel panic - not syncing: Hard LOCKUP
[10162288.039445] CPU: 32 PID: 4476 Comm: tp_osd_tp Kdump: loaded Tainted: G          I    ------------ T 3.10.0-1062.21.1.el7.x86_64 #1
[10162288.039446] Hardware name: IBM System x3850 X5 -[7143SPM]-/Node 1, Processor Card, BIOS -[G0E173BUS-1.73]- 01/21/2012
[10162288.039446] Call Trace:
[10162288.039454]  <NMI>  [<ffffffff93f7b416>] dump_stack+0x19/0x1b
[10162288.039458]  [<ffffffff93f74a0b>] panic+0xe8/0x21f
[10162288.039463]  [<ffffffff9382f958>] ? show_regs+0x58/0x290
[10162288.039470]  [<ffffffff9389b88f>] nmi_panic+0x3f/0x40
[10162288.039475]  [<ffffffff9394eda1>] watchdog_overflow_callback+0x121/0x140
[10162288.039480]  [<ffffffff939a8497>] __perf_event_overflow+0x57/0x100
[10162288.039485]  [<ffffffff939b1c34>] perf_event_overflow+0x14/0x20
[10162288.039489]  [<ffffffff9380ac70>] handle_pmi_common+0x1a0/0x250
[10162288.039493]  [<ffffffff9380af4f>] intel_pmu_handle_irq+0xcf/0x1d0
[10162288.039496]  [<ffffffff93f84031>] perf_event_nmi_handler+0x31/0x50
[10162288.039499]  [<ffffffff93f8593c>] nmi_handle.isra.0+0x8c/0x150
[10162288.039501]  [<ffffffff93f841bb>] ? save_paranoid+0xfb/0x140
[10162288.039504]  [<ffffffff93f85c18>] do_nmi+0x218/0x460
[10162288.039506]  [<ffffffff93f841af>] ? save_paranoid+0xef/0x140
[10162288.039509]  [<ffffffff93f84d9c>] end_repeat_nmi+0x1e/0x81
[10162288.039512]  [<ffffffff939176c6>] ? native_queued_spin_lock_slowpath+0x126/0x200
[10162288.039514]  [<ffffffff939176c6>] ? native_queued_spin_lock_slowpath+0x126/0x200
[10162288.039517]  [<ffffffff939176c6>] ? native_queued_spin_lock_slowpath+0x126/0x200
[10162288.039521]  <EOE>  <IRQ>  [<ffffffff93f754ee>] queued_spin_lock_slowpath+0xb/0xf
[10162288.039524]  [<ffffffff93f83ba7>] _raw_spin_lock_irqsave+0x37/0x40
[10162288.039529]  [<ffffffff939c7dcd>] get_page_from_freelist+0x69d/0xaa0
[10162288.039533]  [<ffffffff93ebb58f>] ? tcp_send_ack+0x11f/0x170
[10162288.039536]  [<ffffffff939c8336>] __alloc_pages_nodemask+0x166/0x450
[10162288.039541]  [<ffffffff93a16c28>] alloc_pages_current+0x98/0x110
[10162288.039546]  [<ffffffff93a24f03>] new_slab+0x393/0x4e0
[10162288.039548]  [<ffffffff93a2541c>] ___slab_alloc+0x3cc/0x520
[10162288.039561]  [<ffffffffc03ecf0d>] ? bnx2_poll_work+0x9ad/0x1350 [bnx2]
[10162288.039564]  [<ffffffff93e4fbd9>] ? __netif_receive_skb_core+0x729/0xa10
[10162288.039571]  [<ffffffffc03ecf0d>] ? bnx2_poll_work+0x9ad/0x1350 [bnx2]
[10162288.039573]  [<ffffffff93f77dbc>] __slab_alloc+0x40/0x5c
[10162288.039575]  [<ffffffff93a26120>] __kmalloc+0x1c0/0x230
[10162288.039581]  [<ffffffffc03ecf0d>] bnx2_poll_work+0x9ad/0x1350 [bnx2]
[10162288.039588]  [<ffffffffc03ed8ea>] bnx2_poll_msix+0x3a/0xc0 [bnx2]
[10162288.039591]  [<ffffffff93e5057f>] net_rx_action+0x26f/0x390
[10162288.039597]  [<ffffffff938a54b5>] __do_softirq+0xf5/0x280
[10162288.039601]  [<ffffffff93f9142c>] call_softirq+0x1c/0x30
[10162288.039604]  [<ffffffff9382f715>] do_softirq+0x65/0xa0
[10162288.039606]  [<ffffffff938a5835>] irq_exit+0x105/0x110
[10162289.039990]  [<ffffffff93f929d8>] smp_apic_timer_interrupt+0x48/0x60
[10162289.039994]  [<ffffffff93f8eefa>] apic_timer_interrupt+0x16a/0x170
[10162289.039995] NMI watchdog: Watchdog detected hard LOCKUP on cpu 62
[10162289.039998]  <EOI>  [<ffffffff938d51b4>] ? finish_task_switch+0x54/0x1c0
[10162289.040001]  [<ffffffff93f80d92>] __schedule+0x402/0x840
[10162289.040003]  [<ffffffff93f811f9>] schedule+0x29/0x70
[10162289.040006]  [<ffffffff93912306>] futex_wait_queue_me+0xc6/0x130
[10162289.040009]  [<ffffffff939130ab>] futex_wait+0x17b/0x280
[10162289.040011]  [<ffffffff938d7959>] ? ttwu_do_wakeup+0x19/0xe0
[10162289.040014]  [<ffffffff938c9fe0>] ? hrtimer_get_res+0x50/0x50
[10162289.040017]  [<ffffffff939122e4>] ? futex_wait_queue_me+0xa4/0x130
[10162289.040020]  [<ffffffff93914df6>] do_futex+0x106/0x5a0
[10162289.040023]  [<ffffffff93915310>] SyS_futex+0x80/0x190
[10162289.040026]  [<ffffffff93f8dede>] system_call_fastpath+0x25/0x2a
[10162289.040029]  [<ffffffff93f8de21>] ? system_call_after_swapgs+0xae/0x146
[10162290.041036] Modules linked in: target_core_pscsi target_core_file target_core_iblock binfmt_misc tcp_diag inet_diag target_core_user uio rpcrdma ib_isert sunrpc iscsi_target_mod ib_iser libiscsi scsi_transport_iscsi ib_srpt target_core_mod ib_srp scsi_transport_srp scsi_tgt ib_ipoib rdma_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm vfat fat intel_powerclamp coretemp kvm_intel kvm irqbypass crc32_pclmul ghash_clmulni_intel aesni_intel lrw gf128mul glue_helper ablk_helper ocrdma(T) cdc_ether iTCO_wdt ib_core gpio_ich iTCO_vendor_support pcspkr cryptd usbnet lpc_ich mii sg ioatdma i2c_i801 i7core_edac ipmi_si ipmi_devintf pcc_cpufreq ipmi_msghandler acpi_cpufreq ip_tables xfs libcrc32c sr_mod cdrom ata_generic pata_acpi sd_mod crc_t10dif crct10dif_generic mgag200 drm_kms_helper syscopyarea ata_piix sysfillrect
[10162290.041050]  sysimgblt igb fb_sys_fops ptp ttm crct10dif_pclmul drm crct10dif_common pps_core crc32c_intel dca libata serio_raw megaraid_sas be2net bnx2 drm_panel_orientation_quirks i2c_algo_bit dm_mirror dm_region_hash dm_log dm_mod
[10162290.041054] CPU: 62 PID: 4193 Comm: msgr-worker-0 Kdump: loaded Tainted: G          I    ------------ T 3.10.0-1062.21.1.el7.x86_64 #1
[10162290.041055] Hardware name: IBM System x3850 X5 -[7143SPM]-/Node 1, Processor Card, BIOS -[G0E173BUS-1.73]- 01/21/2012
[10162290.041057] task: ffff88adcf270000 ti: ffff88adca894000 task.ti: ffff88adca894000
[10162290.041064] RIP: 0010:[<ffffffff938d51b4>]  [<ffffffff938d51b4>] finish_task_switch+0x54/0x1c0
[10162290.041065] RSP: 0018:ffff88adca897a98  EFLAGS: 00000282
[10162290.041066] RAX: ffff88adbd8341c0 RBX: ffff88adcf270068 RCX: 00000000c0000100
[10162290.041068] RDX: ffff88adca895fd8 RSI: ffff88adcf270000 RDI: ffff88bddf59ac80
[10162290.041069] RBP: ffff88adca897ab8 R08: ffff88adca894000 R09: 0000000000000001
[10162290.041070] R10: 0000000000000001 R11: ffff88bddb78ba00 R12: 0000000000000292
[10162290.041071] R13: ffff88adca897a48 R14: ffff88adca897a00 R15: ffff88adcf270068
[10162290.041073] FS:  00007fdbe5d7e700(0000) GS:ffff88bddf580000(0000) knlGS:0000000000000000
[10162290.041074] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[10162290.041076] CR2: 00007f8401bee528 CR3: 000000105dd34000 CR4: 00000000000207e0
[10162290.041077] Call Trace:
[10162290.041082]  [<ffffffff93f80d92>] __schedule+0x402/0x840
[10162290.041090]  [<ffffffff93a3afd9>] ? __mem_cgroup_uncharge_common+0x49/0x2f0
[10162290.041093]  [<ffffffff938d7c36>] __cond_resched+0x26/0x30
[10162290.041094]  [<ffffffff93f8149a>] _cond_resched+0x3a/0x50
[10162290.041096]  [<ffffffff93f804f2>] down_read+0x12/0x40
[10162290.041100]  [<ffffffff93a00bd0>] rmap_walk+0x300/0x360
[10162290.041105]  [<ffffffff93a2c1a8>] ? migrate_page_copy+0x128/0x380
[10162290.041107]  [<ffffffff93a2a466>] remove_migration_ptes+0x46/0x70
[10162290.041110]  [<ffffffff93a2b280>] ? migrate_vma_collect_pmd+0x610/0x610
[10162290.041112]  [<ffffffff93a2c5bb>] move_to_new_page+0x13b/0x250
[10162290.041114]  [<ffffffff93a00e9b>] ? try_to_unmap+0x8b/0xe0
[10162290.041116]  [<ffffffff939ff130>] ? page_remove_rmap+0x160/0x160
[10162290.041118]  [<ffffffff93a2defd>] migrate_pages+0x6dd/0x7f0
[10162290.041121]  [<ffffffff93a2a830>] ? migrate_vma_collect.constprop.52+0xe0/0xe0
[10162290.041123]  [<ffffffff93a2e97c>] migrate_misplaced_page+0xcc/0x100
[10162290.041128]  [<ffffffff939f12ed>] do_numa_page+0x19d/0x250
[10162290.041130]  [<ffffffff939f3c5b>] handle_mm_fault+0xadb/0xfb0
[10162290.041136]  [<ffffffff93e2e731>] ? sock_aio_read+0x21/0x30
[10162290.041144]  [<ffffffff93a49e93>] ? do_sync_read+0x93/0xe0
[10162290.041147]  [<ffffffff93f88653>] __do_page_fault+0x213/0x500
[10162290.041150]  [<ffffffff93f88975>] do_page_fault+0x35/0x90
[10162290.041152]  [<ffffffff93f84ac9>] ? error_swapgs+0xaa/0xc0
[10162290.041154]  [<ffffffff93f84778>] page_fault+0x28/0x30
[10162290.041175] Code: 8b 36 66 66 66 66 90 65 48 8b 34 25 80 0e 01 00 66 66 66 66 90 41 c7 45 28 00 00 00 00 48 89 df c6 07 00 66 66 66 90 fb 66 66 90 <66> 66 90 65 48 8b 04 25 80 0e 01 00 48 8b 98 78 01 00 00 48 85 

Environment

  • Red Hat Enterprise Linux 7.7 (kernel-3.10.0-1062.21.1.el7)
  • IBM System x3850 X5

Subscriber exclusive content

A Red Hat subscription provides unlimited access to our knowledgebase, tools, and much more.

Current Customers and Partners

Log in for full access

Log In

New to Red Hat?

Learn more about Red Hat subscriptions

Using a Red Hat product through a public cloud?

How to access this content