Frequent Freeze with MCE errors observed on Lenovo M91p 4524CTO running Red Hat Enterprise Linux 6.4 and later releases

Solution Verified - Updated -

Issue

  • Frequent hard freeze on multiple Lenovo M91p 4524CTO workstations.
  • Freeze only appears to occur when a user is logged in but away from their desk (system idle, screensaver on or monitor in powersave).
  • Issue is observed in runlevel 3 as well as runlevel 5. Issue is not reproduced when "nomodeset" parameter is passed to kernel line.
  • The freeze is complete, screen and keyboard go dead (numlock light not responding, sysrq sequences have no effect) and no network traffic is accepted or returned (ping, ssh, etc.).
  • No information about the freeze is recorded in the local logs or those sent to the loghost. The only recovery method is to hold the power button down for >=4sec and press it again (or remove/restore power).
  • Not able to collect vmcore because of complete hang.

  • Following MCE logs were collected from serial consle.

kernel: Disabling lock debugging due to kernel taint
kernel: [Hardware Error]: CPU 4: Machine Check Exception: 5 Bank 4: b200000011000402
kernel: [Hardware Error]: RIP !INEXACT! 10:<ffffffff812d3a71> {intel_idle+0xb1/0x170}
kernel: [Hardware Error]: TSC 1a61fb503eb7c 
kernel: [Hardware Error]: PROCESSOR 0:206a7 TIME 1378866642 SOCKET 0 APIC 1
kernel: [Hardware Error]: CPU 0: Machine Check Exception: 5 Bank 4: b200000011000402
kernel: [Hardware Error]: RIP !INEXACT! 33:<0000003a9dd9bbb0> 
kernel: [Hardware Error]: TSC 1a61fb503eb62 
kernel: [Hardware Error]: PROCESSOR 0:206a7 TIME 1378866642 SOCKET 0 APIC 0
kernel: [Hardware Error]: Some CPUs didn't answer in synchronization
kernel: [Hardware Error]: Machine check: Processor context corrupt
kernel: Kernel panic - not syncing: Fatal machine check on current CPU

kernel: Disabling lock debugging due to kernel taint
kernel: [Hardware Error]: CPU 4: Machine Check Exception: 5 Bank 4: b200000011000402
kernel: [Hardware Error]: RIP !INEXACT! 10:<ffffffff812d3a71> {intel_idle+0xb1/0x170}
kernel: [Hardware Error]: TSC 25a05e2014bf8 
kernel: [Hardware Error]: PROCESSOR 0:206a7 TIME 1379010691 SOCKET 0 APIC 1
kernel: [Hardware Error]: Some CPUs didn't answer in synchronization
kernel: [Hardware Error]: Machine check: Processor context corrupt
kernel: Kernel panic - not syncing: Fatal machine check on current CPU

# mcelog --ascii
CPU 4: Machine Check Exception: 5 Bank 4: b200000011000402
RIP !INEXACT! 10:<ffffffff812d3a71> {intel_idle+0xb1/0x170}
TSC 1a61fb503eb7c
PROCESSOR 0:206a7 TIME 1378866642 SOCKET 0 APIC 1
Hardware event. This is not a software error.
CPU 4 BANK 4 TSC 1a61fb503eb7c 
RIP !INEXACT! 10:ffffffff812d3a71
TIME 1378866642 Tue Sep 10 22:30:42 2013
MCG status:RIPV MCIP 
MCi status:
Uncorrected error
Error enabled
Processor context corrupt
MCA: Internal unclassified error: 402
PCU: No error <24:11>
STATUS b200000011000402 MCGSTATUS 5
CPUID Vendor Intel Family 6 Model 42
RIP: intel_idle+0xb1/0x170}
SOCKET 0 APIC 1

Environment

  • Lenovo M91p 4524CTO Workstation
  • kernel-2.6.32-358.el6 & later

Subscriber exclusive content

A Red Hat subscription provides unlimited access to our knowledgebase, tools, and much more.

Current Customers and Partners

Log in for full access

Log In

New to Red Hat?

Learn more about Red Hat subscriptions

Using a Red Hat product through a public cloud?

How to access this content