What does the message "HARDWARE ERROR. This is *NOT* a software problem!" mean?
Issue
- /var/log/messages or /var/log/mcelog contain the following messages :
kernel: Machine check events logged
mcelog: MCE 0
mcelog: HARDWARE ERROR. This is *NOT* a software problem!
mcelog: Please contact your hardware vendor
mcelog: Unknown Intel CPU type family 6 model 2c
mcelog: CPU 0 BANK 8 TSC a66b05434fcf4 [at 2668 Mhz 12 days 16:48:42 uptime (unreliable)]
mcelog: MISC 5522140800080282 ADDR 4f83b8dc0
mcelog: MCG status:
mcelog: MCi status:
mcelog: MCi_MISC register valid
mcelog: MCi_ADDR register valid
mcelog: MCA: MEMORY CONTROLLER RD_CHANNELunspecified_ERR
mcelog: Transaction: Memory read error
mcelog: STATUS 8c0000400001009f MCGSTATUS 0
kernel: BUG: soft lockup - CPU#10 stuck for 10s! [mcelog:6356]
- Other similar errors:
Hardware event. This is not a software error.
Corrected error
Transaction: Memory scrubbing error
Memory ECC error occurred during scrub
Memory corrected error count (CORE_ERR_CNT): 1
Memory DIMM ID of error: 1
Memory channel ID of error: 2
Hardware event. This is not a software error.
Sometimes there are traces in the /var/log/messages:
Jan 8 08:30:27 Hostname kernel: Pid: 30350, comm: rgmanager Tainted: G W --------------- 2.6.32-358.el6.x86_64 #1 Dell Inc. PowerEdge R910/0NCWG9
Jan 8 08:30:27 Hostname kernel: RIP: 0010:[<ffffffff8150ffce>] [<ffffffff8150ffce>] _spin_lock+0x1e/0x30
Jan 8 08:30:27 Hostname kernel: RSP: 0018:ffff8820c05cdd10 EFLAGS: 00000283
Jan 8 08:30:27 Hostname kernel: RAX: 0000000000003964 RBX: ffff8820c05cdd10 RCX: 0000000000000000
Jan 8 08:30:27 Hostname kernel: RDX: 000000000000395f RSI: 000000000000001b RDI: ffffffff81e227e8
Jan 8 08:30:27 Hostname kernel: RBP: ffffffff8100bb8e R08: 0000000000000000 R09: 0000000000000000
Jan 8 08:30:27 Hostname kernel: R10: 0000000000000000 R11: 0000000000000000 R12: ffff8810685602d8
Jan 8 08:30:27 Hostname kernel: R13: 0000000000000000 R14: ffff883080010e40 R15: 0000000000000000
Jan 8 08:30:27 Hostname kernel: FS: 00007f3e81a20700(0000) GS:ffff8830b8880000(0000) knlGS:0000000000000000
Jan 8 08:30:27 Hostname kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
Jan 8 08:30:27 Hostname kernel: CR2: 00000000027477b0 CR3: 00000010671a6000 CR4: 00000000000007e0
Jan 8 08:30:27 Hostname kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
Jan 8 08:30:27 Hostname kernel: DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Jan 8 08:30:27 Hostname kernel: Process rgmanager (pid: 30350, threadinfo ffff8820c05cc000, task ffff8820be965540)
Jan 8 08:30:27 Hostname kernel: Stack:
Jan 8 08:30:27 Hostname kernel: ffff8820c05cdd40 ffffffff8104b8d0 ffff8820c05cdd60 ffff884068122400
Jan 8 08:30:27 Hostname kernel: <d> ffff883a408fc040 ffff883a408fc040 ffff8820c05cdd60 ffffffff8106b179
Jan 8 08:30:27 Hostname kernel: <d> ffff884068122400 ffff881066d31440 ffff8820c05cdde0 ffffffff8106b879
Jan 8 08:30:27 Hostname kernel: Call Trace:
Jan 8 08:30:27 Hostname kernel: [<ffffffff8104b8d0>] ? pgd_alloc+0x50/0x130
Jan 8 08:30:27 Hostname kernel: [<ffffffff8106b179>] ? mm_init+0x139/0x180
Jan 8 08:30:27 Hostname kernel: [<ffffffff8106b879>] ? dup_mm+0xa9/0x520
Jan 8 08:30:27 Hostname kernel: [<ffffffff81061d03>] ? sched_autogroup_fork+0x63/0xa0
Jan 8 08:30:27 Hostname kernel: [<ffffffff8106cb6f>] ? copy_process+0xd5f/0x1450
Jan 8 08:30:27 Hostname kernel: [<ffffffff8106d2f4>] ? do_fork+0x94/0x460
Jan 8 08:30:27 Hostname kernel: [<ffffffff8109bfb4>] ? hrtimer_nanosleep+0xc4/0x180
Jan 8 08:30:27 Hostname kernel: [<ffffffff8109ae00>] ? hrtimer_wakeup+0x0/0x30
Jan 8 08:30:27 Hostname kernel: [<ffffffff81009598>] ? sys_clone+0x28/0x30
Jan 8 08:30:27 Hostname kernel: [<ffffffff8100b393>] ? stub_clone+0x13/0x20
Jan 8 08:30:27 Hostname kernel: [<ffffffff8100b072>] ? system_call_fastpath+0x16/0x1b
Jan 8 08:30:27 Hostname kernel: Code: 00 00 00 01 74 05 e8 b2 33 d7 ff c9 c3 55 48 89 e5 0f 1f 44 00 00 b8 00 00 01 00 f0 0f c1 07 0f b7 d0 c1 e8 10 39 c2 74 0e f3 90 <0f> b7 17 eb f5 83 3f 00 75 f4 eb df c9 c3 0f 1f 40 00 55 48 89
Jan 8 08:30:27 Hostname kernel: Call Trace:
Jan 8 08:30:27 Hostname kernel: [<ffffffff8104b8d0>] ? pgd_alloc+0x50/0x130
Jan 8 08:30:27 Hostname kernel: [<ffffffff8106b179>] ? mm_init+0x139/0x180
Jan 8 08:30:27 Hostname kernel: [<ffffffff8106b879>] ? dup_mm+0xa9/0x520
Jan 8 08:30:27 Hostname kernel: [<ffffffff81061d03>] ? sched_autogroup_fork+0x63/0xa0
Jan 8 08:30:27 Hostname kernel: [<ffffffff8106cb6f>] ? copy_process+0xd5f/0x1450
Jan 8 08:30:27 Hostname kernel: [<ffffffff8106d2f4>] ? do_fork+0x94/0x460
Jan 8 08:30:27 Hostname kernel: [<ffffffff8109bfb4>] ? hrtimer_nanosleep+0xc4/0x180
Jan 8 08:30:27 Hostname kernel: [<ffffffff8109ae00>] ? hrtimer_wakeup+0x0/0x30
Jan 8 08:30:27 Hostname kernel: [<ffffffff81009598>] ? sys_clone+0x28/0x30
Jan 8 08:30:27 Hostname kernel: [<ffffffff8100b393>] ? stub_clone+0x13/0x20
Jan 8 08:30:27 Hostname kernel: [<ffffffff8100b072>] ? system_call_fastpath+0x16/0x1b
Jan 8 08:30:39 Hostname kernel: BUG: soft lockup - CPU#3 stuck for 67s! [sshd:4711]
......
There could also be error records in the /var/mcelog as the below:
MCE 0
CPU 2 BANK 9
TIME 1388666356 Thu Jan 2 20:39:16 2014
MCG status:
MCi status:
Uncorrected error
Error enabled
MCA: MEMORY CONTROLLER RD_CHANNELunspecified_ERR
Transaction: Memory read error
STATUS b00000000800009f MCGSTATUS 0
MCGCAP 1000c18 APICID 80 SOCKETID 2
CPUID Vendor Intel Family 6 Model 47
Hardware event. This is not a software error.
.....
- A cronjob running
/usr/sbin/mcelog --ignorenodev --filter >> /var/log/mcelog
results in the following, reoccuring errors:
TIME 1320670862 Mon Nov 7 14:01:02 2011
MCG status:
MCi status:
Corrected error
Error enabled
MCi_MISC register valid
MCA: BUS Level-3 Generic Generic Other-transaction Request-did-not-timeout Error
<16:2> BQ_DCU_READ_TYPE BQ_ERR_HARD_TYPE BQ_ERR_HARD_TYPE
timeout BINIT (ROB timeout). No micro-instruction retired for some time
STATUS 9800004000020e0f MCGSTATUS 0
MCGCAP 1000c16 APICID 0 SOCKETID 0
CPUID Vendor Intel Family 6 Model 46
Hardware event. This is not a software error.
MCE 0
CPU 0 BANK 0
MISC 1
- Why do we see a lot of MCA: MEMORY CONTROLLER GEN_CHANNELunspecified_ERR in mcelog?
/var/log/messages
and/var/log/mcelog
contain messages similar to:
TIME 1336064652 Fri May 4 01:04:12 2012
MCG status:
MCi status:
Error overflow
Corrected error
Error enabled
MCA: MEMORY CONTROLLER GEN_CHANNELunspecified_ERR
Transaction: Generic undefined request
STATUS d00002c0000a008f MCGSTATUS 0
MCGCAP 1000c18 APICID 40 SOCKETID 1
CPUID Vendor Intel Family 6 Model 47
Hardware event. This is not a software error.
MCE 0
CPU 8 BANK 9
Environment
- Red Hat Enterprise Linux
Subscriber exclusive content
A Red Hat subscription provides unlimited access to our knowledgebase, tools, and much more.