"Kernel panic - not syncing: Fatal Machine check" or Machine Check Exception (MCE) in /var/log/messages

Solution Verified - Updated -

Issue

  • System hangs or kernel panics with MCE (Machine Check Exception) in /var/log/messages file.
  • System is hung or not responding. Checked the messages in netdump server. Found the following messages ..."Kernel panic - not syncing: Machine check".
  • "Kernel panic - not syncing: Uncorrected machine check"
  • System reported hardware error like faulty DIMM or temperature warning before hanging
  • System rebooted due to Machine Check Exception and a vmcore was collected.
Kernel panic - not syncing: Fatal Machine check
Pid: 0, comm: swapper Tainted: G   M       ----------------   2.6.32-220.el6.x86_64 #1
Call Trace:
 <#MC>  [<ffffffff814ec341>] ? panic+0x78/0x143
 [<ffffffff81021d7f>] ? mce_panic+0x21f/0x240
 [<ffffffff81023638>] ? do_machine_check+0xa18/0xa60
 [<ffffffff812c4a41>] ? intel_idle+0xb1/0x170
 [<ffffffff814ef86c>] ? machine_check+0x1c/0x30
 [<ffffffff812c4a41>] ? intel_idle+0xb1/0x170
 <<EOE>>  [<ffffffff81095d98>] ? hrtimer_start+0x18/0x20
 [<ffffffff813f9f67>] ? cpuidle_idle_call+0xa7/0x140
 [<ffffffff81009e06>] ? cpu_idle+0xb6/0x110
 [<ffffffff814e5f43>] ? start_secondary+0x202/0x245
  • /var/log/messages or /var/log/mcelog contain the following messages :
kernel: Machine check events logged
mcelog: MCE 0
mcelog: HARDWARE ERROR. This is *NOT* a software problem!
mcelog: Please contact your hardware vendor
mcelog: Unknown Intel CPU type family 6 model 2c
mcelog: CPU 0 BANK 8 TSC a66b05434fcf4 [at 2668 Mhz 12 days 16:48:42 uptime (unreliable)]
mcelog: MISC 5522140800080282 ADDR 4f83b8dc0
mcelog: MCG status:
mcelog: MCi status:
mcelog: MCi_MISC register valid
mcelog: MCi_ADDR register valid
mcelog: MCA: MEMORY CONTROLLER RD_CHANNELunspecified_ERR
mcelog: Transaction: Memory read error
mcelog: STATUS 8c0000400001009f MCGSTATUS 0
kernel: BUG: soft lockup - CPU#10 stuck for 10s! [mcelog:6356]
  • Other similar errors:
Hardware event. This is not a software error.
Corrected error
Transaction: Memory scrubbing error
Memory ECC error occurred during scrub
Memory corrected error count (CORE_ERR_CNT): 1
Memory DIMM ID of error: 1
Memory channel ID of error: 2
Hardware event. This is not a software error.

Sometimes there are traces in the /var/log/messages:

Jan  8 08:30:27 Hostname kernel: Pid: 30350, comm: rgmanager Tainted: G        W  ---------------    2.6.32-358.el6.x86_64 #1 Dell Inc. PowerEdge R910/0NCWG9
Jan  8 08:30:27 Hostname kernel: RIP: 0010:[<ffffffff8150ffce>]  [<ffffffff8150ffce>] _spin_lock+0x1e/0x30
Jan  8 08:30:27 Hostname kernel: RSP: 0018:ffff8820c05cdd10  EFLAGS: 00000283
Jan  8 08:30:27 Hostname kernel: RAX: 0000000000003964 RBX: ffff8820c05cdd10 RCX: 0000000000000000
Jan  8 08:30:27 Hostname kernel: RDX: 000000000000395f RSI: 000000000000001b RDI: ffffffff81e227e8
Jan  8 08:30:27 Hostname kernel: RBP: ffffffff8100bb8e R08: 0000000000000000 R09: 0000000000000000
Jan  8 08:30:27 Hostname kernel: R10: 0000000000000000 R11: 0000000000000000 R12: ffff8810685602d8
Jan  8 08:30:27 Hostname kernel: R13: 0000000000000000 R14: ffff883080010e40 R15: 0000000000000000
Jan  8 08:30:27 Hostname kernel: FS:  00007f3e81a20700(0000) GS:ffff8830b8880000(0000) knlGS:0000000000000000
Jan  8 08:30:27 Hostname kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
Jan  8 08:30:27 Hostname kernel: CR2: 00000000027477b0 CR3: 00000010671a6000 CR4: 00000000000007e0
Jan  8 08:30:27 Hostname kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
Jan  8 08:30:27 Hostname kernel: DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Jan  8 08:30:27 Hostname kernel: Process rgmanager (pid: 30350, threadinfo ffff8820c05cc000, task ffff8820be965540)
Jan  8 08:30:27 Hostname kernel: Stack:
Jan  8 08:30:27 Hostname kernel: ffff8820c05cdd40 ffffffff8104b8d0 ffff8820c05cdd60 ffff884068122400
Jan  8 08:30:27 Hostname kernel: <d> ffff883a408fc040 ffff883a408fc040 ffff8820c05cdd60 ffffffff8106b179
Jan  8 08:30:27 Hostname kernel: <d> ffff884068122400 ffff881066d31440 ffff8820c05cdde0 ffffffff8106b879
Jan  8 08:30:27 Hostname kernel: Call Trace:
Jan  8 08:30:27 Hostname kernel: [<ffffffff8104b8d0>] ? pgd_alloc+0x50/0x130
Jan  8 08:30:27 Hostname kernel: [<ffffffff8106b179>] ? mm_init+0x139/0x180
Jan  8 08:30:27 Hostname kernel: [<ffffffff8106b879>] ? dup_mm+0xa9/0x520
Jan  8 08:30:27 Hostname kernel: [<ffffffff81061d03>] ? sched_autogroup_fork+0x63/0xa0
Jan  8 08:30:27 Hostname kernel: [<ffffffff8106cb6f>] ? copy_process+0xd5f/0x1450
Jan  8 08:30:27 Hostname kernel: [<ffffffff8106d2f4>] ? do_fork+0x94/0x460
Jan  8 08:30:27 Hostname kernel: [<ffffffff8109bfb4>] ? hrtimer_nanosleep+0xc4/0x180
Jan  8 08:30:27 Hostname kernel: [<ffffffff8109ae00>] ? hrtimer_wakeup+0x0/0x30
Jan  8 08:30:27 Hostname kernel: [<ffffffff81009598>] ? sys_clone+0x28/0x30
Jan  8 08:30:27 Hostname kernel: [<ffffffff8100b393>] ? stub_clone+0x13/0x20
Jan  8 08:30:27 Hostname kernel: [<ffffffff8100b072>] ? system_call_fastpath+0x16/0x1b
Jan  8 08:30:27 Hostname kernel: Code: 00 00 00 01 74 05 e8 b2 33 d7 ff c9 c3 55 48 89 e5 0f 1f 44 00 00 b8 00 00 01 00 f0 0f c1 07 0f b7 d0 c1 e8 10 39 c2 74 0e f3 90 <0f> b7 17 eb f5 83 3f 00 75 f4 eb df c9 c3 0f 1f 40 00 55 48 89
Jan  8 08:30:27 Hostname kernel: Call Trace:
Jan  8 08:30:27 Hostname kernel: [<ffffffff8104b8d0>] ? pgd_alloc+0x50/0x130
Jan  8 08:30:27 Hostname kernel: [<ffffffff8106b179>] ? mm_init+0x139/0x180
Jan  8 08:30:27 Hostname kernel: [<ffffffff8106b879>] ? dup_mm+0xa9/0x520
Jan  8 08:30:27 Hostname kernel: [<ffffffff81061d03>] ? sched_autogroup_fork+0x63/0xa0
Jan  8 08:30:27 Hostname kernel: [<ffffffff8106cb6f>] ? copy_process+0xd5f/0x1450
Jan  8 08:30:27 Hostname kernel: [<ffffffff8106d2f4>] ? do_fork+0x94/0x460
Jan  8 08:30:27 Hostname kernel: [<ffffffff8109bfb4>] ? hrtimer_nanosleep+0xc4/0x180
Jan  8 08:30:27 Hostname kernel: [<ffffffff8109ae00>] ? hrtimer_wakeup+0x0/0x30
Jan  8 08:30:27 Hostname kernel: [<ffffffff81009598>] ? sys_clone+0x28/0x30
Jan  8 08:30:27 Hostname kernel: [<ffffffff8100b393>] ? stub_clone+0x13/0x20
Jan  8 08:30:27 Hostname kernel: [<ffffffff8100b072>] ? system_call_fastpath+0x16/0x1b
Jan  8 08:30:39 Hostname kernel: BUG: soft lockup - CPU#3 stuck for 67s! [sshd:4711]
......

There could also be error records in the /var/mcelog as the below:

MCE 0
CPU 2 BANK 9
TIME 1388666356 Thu Jan  2 20:39:16 2014
MCG status:
MCi status:
Uncorrected error
Error enabled
MCA: MEMORY CONTROLLER RD_CHANNELunspecified_ERR
Transaction: Memory read error
STATUS b00000000800009f MCGSTATUS 0
MCGCAP 1000c18 APICID 80 SOCKETID 2
CPUID Vendor Intel Family 6 Model 47
Hardware event. This is not a software error.
.....
  • A cronjob running /usr/sbin/mcelog --ignorenodev --filter >> /var/log/mcelog results in the following, reoccuring errors:
TIME 1320670862 Mon Nov  7 14:01:02 2011
MCG status:
MCi status:
Corrected error
Error enabled
MCi_MISC register valid
MCA: BUS Level-3 Generic Generic Other-transaction Request-did-not-timeout Error
<16:2> BQ_DCU_READ_TYPE BQ_ERR_HARD_TYPE BQ_ERR_HARD_TYPE
timeout BINIT (ROB timeout). No micro-instruction retired for some time
STATUS 9800004000020e0f MCGSTATUS 0
MCGCAP 1000c16 APICID 0 SOCKETID 0
CPUID Vendor Intel Family 6 Model 46
Hardware event. This is not a software error.
MCE 0
CPU 0 BANK 0
MISC 1
  • Why do we see a lot of MCA: MEMORY CONTROLLER GEN_CHANNELunspecified_ERR in mcelog? /var/log/messages and /var/log/mcelog contain messages similar to:
TIME 1336064652 Fri May  4 01:04:12 2012
MCG status:
MCi status:
Error overflow
Corrected error
Error enabled
MCA: MEMORY CONTROLLER GEN_CHANNELunspecified_ERR
Transaction: Generic undefined request
STATUS d00002c0000a008f MCGSTATUS 0
MCGCAP 1000c18 APICID 40 SOCKETID 1
CPUID Vendor Intel Family 6 Model 47
Hardware event. This is not a software error.
MCE 0
CPU 8 BANK 9

Environment

  • Red Hat Enterprise Linux 7
  • Red Hat Enterprise Linux 6
  • Red Hat Enterprise Linux 5
  • Red Hat Enterprise Linux 4
  • Red Hat Enterprise Linux 3

Subscriber exclusive content

A Red Hat subscription provides unlimited access to our knowledgebase of over 48,000 articles and solutions.

Current Customers and Partners

Log in for full access

Log In