What does the message "HARDWARE ERROR. This is *NOT* a software problem!" mean?

Solution Verified - Updated -

Issue

  • /var/log/messages or /var/log/mcelog contain the following messages :
kernel: Machine check events logged
mcelog: MCE 0
mcelog: HARDWARE ERROR. This is *NOT* a software problem!
mcelog: Please contact your hardware vendor
mcelog: Unknown Intel CPU type family 6 model 2c
mcelog: CPU 0 BANK 8 TSC a66b05434fcf4 [at 2668 Mhz 12 days 16:48:42 uptime (unreliable)]
mcelog: MISC 5522140800080282 ADDR 4f83b8dc0
mcelog: MCG status:
mcelog: MCi status:
mcelog: MCi_MISC register valid
mcelog: MCi_ADDR register valid
mcelog: MCA: MEMORY CONTROLLER RD_CHANNELunspecified_ERR
mcelog: Transaction: Memory read error
mcelog: STATUS 8c0000400001009f MCGSTATUS 0
kernel: BUG: soft lockup - CPU#10 stuck for 10s! [mcelog:6356]
  • Other similar errors:
Hardware event. This is not a software error.
Corrected error
Transaction: Memory scrubbing error
Memory ECC error occurred during scrub
Memory corrected error count (CORE_ERR_CNT): 1
Memory DIMM ID of error: 1
Memory channel ID of error: 2
Hardware event. This is not a software error.

Sometimes there are traces in the /var/log/messages:

Jan  8 08:30:27 Hostname kernel: Pid: 30350, comm: rgmanager Tainted: G        W  ---------------    2.6.32-358.el6.x86_64 #1 Dell Inc. PowerEdge R910/0NCWG9
Jan  8 08:30:27 Hostname kernel: RIP: 0010:[<ffffffff8150ffce>]  [<ffffffff8150ffce>] _spin_lock+0x1e/0x30
Jan  8 08:30:27 Hostname kernel: RSP: 0018:ffff8820c05cdd10  EFLAGS: 00000283
Jan  8 08:30:27 Hostname kernel: RAX: 0000000000003964 RBX: ffff8820c05cdd10 RCX: 0000000000000000
Jan  8 08:30:27 Hostname kernel: RDX: 000000000000395f RSI: 000000000000001b RDI: ffffffff81e227e8
Jan  8 08:30:27 Hostname kernel: RBP: ffffffff8100bb8e R08: 0000000000000000 R09: 0000000000000000
Jan  8 08:30:27 Hostname kernel: R10: 0000000000000000 R11: 0000000000000000 R12: ffff8810685602d8
Jan  8 08:30:27 Hostname kernel: R13: 0000000000000000 R14: ffff883080010e40 R15: 0000000000000000
Jan  8 08:30:27 Hostname kernel: FS:  00007f3e81a20700(0000) GS:ffff8830b8880000(0000) knlGS:0000000000000000
Jan  8 08:30:27 Hostname kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
Jan  8 08:30:27 Hostname kernel: CR2: 00000000027477b0 CR3: 00000010671a6000 CR4: 00000000000007e0
Jan  8 08:30:27 Hostname kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
Jan  8 08:30:27 Hostname kernel: DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Jan  8 08:30:27 Hostname kernel: Process rgmanager (pid: 30350, threadinfo ffff8820c05cc000, task ffff8820be965540)
Jan  8 08:30:27 Hostname kernel: Stack:
Jan  8 08:30:27 Hostname kernel: ffff8820c05cdd40 ffffffff8104b8d0 ffff8820c05cdd60 ffff884068122400
Jan  8 08:30:27 Hostname kernel: <d> ffff883a408fc040 ffff883a408fc040 ffff8820c05cdd60 ffffffff8106b179
Jan  8 08:30:27 Hostname kernel: <d> ffff884068122400 ffff881066d31440 ffff8820c05cdde0 ffffffff8106b879
Jan  8 08:30:27 Hostname kernel: Call Trace:
Jan  8 08:30:27 Hostname kernel: [<ffffffff8104b8d0>] ? pgd_alloc+0x50/0x130
Jan  8 08:30:27 Hostname kernel: [<ffffffff8106b179>] ? mm_init+0x139/0x180
Jan  8 08:30:27 Hostname kernel: [<ffffffff8106b879>] ? dup_mm+0xa9/0x520
Jan  8 08:30:27 Hostname kernel: [<ffffffff81061d03>] ? sched_autogroup_fork+0x63/0xa0
Jan  8 08:30:27 Hostname kernel: [<ffffffff8106cb6f>] ? copy_process+0xd5f/0x1450
Jan  8 08:30:27 Hostname kernel: [<ffffffff8106d2f4>] ? do_fork+0x94/0x460
Jan  8 08:30:27 Hostname kernel: [<ffffffff8109bfb4>] ? hrtimer_nanosleep+0xc4/0x180
Jan  8 08:30:27 Hostname kernel: [<ffffffff8109ae00>] ? hrtimer_wakeup+0x0/0x30
Jan  8 08:30:27 Hostname kernel: [<ffffffff81009598>] ? sys_clone+0x28/0x30
Jan  8 08:30:27 Hostname kernel: [<ffffffff8100b393>] ? stub_clone+0x13/0x20
Jan  8 08:30:27 Hostname kernel: [<ffffffff8100b072>] ? system_call_fastpath+0x16/0x1b
Jan  8 08:30:27 Hostname kernel: Code: 00 00 00 01 74 05 e8 b2 33 d7 ff c9 c3 55 48 89 e5 0f 1f 44 00 00 b8 00 00 01 00 f0 0f c1 07 0f b7 d0 c1 e8 10 39 c2 74 0e f3 90 <0f> b7 17 eb f5 83 3f 00 75 f4 eb df c9 c3 0f 1f 40 00 55 48 89
Jan  8 08:30:27 Hostname kernel: Call Trace:
Jan  8 08:30:27 Hostname kernel: [<ffffffff8104b8d0>] ? pgd_alloc+0x50/0x130
Jan  8 08:30:27 Hostname kernel: [<ffffffff8106b179>] ? mm_init+0x139/0x180
Jan  8 08:30:27 Hostname kernel: [<ffffffff8106b879>] ? dup_mm+0xa9/0x520
Jan  8 08:30:27 Hostname kernel: [<ffffffff81061d03>] ? sched_autogroup_fork+0x63/0xa0
Jan  8 08:30:27 Hostname kernel: [<ffffffff8106cb6f>] ? copy_process+0xd5f/0x1450
Jan  8 08:30:27 Hostname kernel: [<ffffffff8106d2f4>] ? do_fork+0x94/0x460
Jan  8 08:30:27 Hostname kernel: [<ffffffff8109bfb4>] ? hrtimer_nanosleep+0xc4/0x180
Jan  8 08:30:27 Hostname kernel: [<ffffffff8109ae00>] ? hrtimer_wakeup+0x0/0x30
Jan  8 08:30:27 Hostname kernel: [<ffffffff81009598>] ? sys_clone+0x28/0x30
Jan  8 08:30:27 Hostname kernel: [<ffffffff8100b393>] ? stub_clone+0x13/0x20
Jan  8 08:30:27 Hostname kernel: [<ffffffff8100b072>] ? system_call_fastpath+0x16/0x1b
Jan  8 08:30:39 Hostname kernel: BUG: soft lockup - CPU#3 stuck for 67s! [sshd:4711]
......

There could also be error records in the /var/mcelog as the below:

MCE 0
CPU 2 BANK 9
TIME 1388666356 Thu Jan  2 20:39:16 2014
MCG status:
MCi status:
Uncorrected error
Error enabled
MCA: MEMORY CONTROLLER RD_CHANNELunspecified_ERR
Transaction: Memory read error
STATUS b00000000800009f MCGSTATUS 0
MCGCAP 1000c18 APICID 80 SOCKETID 2
CPUID Vendor Intel Family 6 Model 47
Hardware event. This is not a software error.
.....
  • A cronjob running /usr/sbin/mcelog --ignorenodev --filter >> /var/log/mcelog results in the following, reoccuring errors:
TIME 1320670862 Mon Nov  7 14:01:02 2011
MCG status:
MCi status:
Corrected error
Error enabled
MCi_MISC register valid
MCA: BUS Level-3 Generic Generic Other-transaction Request-did-not-timeout Error
<16:2> BQ_DCU_READ_TYPE BQ_ERR_HARD_TYPE BQ_ERR_HARD_TYPE
timeout BINIT (ROB timeout). No micro-instruction retired for some time
STATUS 9800004000020e0f MCGSTATUS 0
MCGCAP 1000c16 APICID 0 SOCKETID 0
CPUID Vendor Intel Family 6 Model 46
Hardware event. This is not a software error.
MCE 0
CPU 0 BANK 0
MISC 1
  • Why do we see a lot of MCA: MEMORY CONTROLLER GEN_CHANNELunspecified_ERR in mcelog? /var/log/messages and /var/log/mcelog contain messages similar to:
TIME 1336064652 Fri May  4 01:04:12 2012
MCG status:
MCi status:
Error overflow
Corrected error
Error enabled
MCA: MEMORY CONTROLLER GEN_CHANNELunspecified_ERR
Transaction: Generic undefined request
STATUS d00002c0000a008f MCGSTATUS 0
MCGCAP 1000c18 APICID 40 SOCKETID 1
CPUID Vendor Intel Family 6 Model 47
Hardware event. This is not a software error.
MCE 0
CPU 8 BANK 9

Environment

  • Red Hat Enterprise Linux

Subscriber exclusive content

A Red Hat subscription provides unlimited access to our knowledgebase, tools, and much more.

Current Customers and Partners

Log in for full access

Log In

New to Red Hat?

Learn more about Red Hat subscriptions

Using a Red Hat product through a public cloud?

How to access this content