Frequent Freeze with MCE errors observed on Lenovo M91p 4524CTO running Red Hat Enterprise Linux 6.4 and later releases
Issue
- Frequent hard freeze on multiple Lenovo M91p 4524CTO workstations.
- Freeze only appears to occur when a user is logged in but away from their desk (system idle, screensaver on or monitor in powersave).
- Issue is observed in runlevel 3 as well as runlevel 5. Issue is not reproduced when "nomodeset" parameter is passed to kernel line.
- The freeze is complete, screen and keyboard go dead (numlock light not responding, sysrq sequences have no effect) and no network traffic is accepted or returned (ping, ssh, etc.).
- No information about the freeze is recorded in the local logs or those sent to the loghost. The only recovery method is to hold the power button down for >=4sec and press it again (or remove/restore power).
-
Not able to collect vmcore because of complete hang.
-
Following MCE logs were collected from serial consle.
kernel: Disabling lock debugging due to kernel taint
kernel: [Hardware Error]: CPU 4: Machine Check Exception: 5 Bank 4: b200000011000402
kernel: [Hardware Error]: RIP !INEXACT! 10:<ffffffff812d3a71> {intel_idle+0xb1/0x170}
kernel: [Hardware Error]: TSC 1a61fb503eb7c
kernel: [Hardware Error]: PROCESSOR 0:206a7 TIME 1378866642 SOCKET 0 APIC 1
kernel: [Hardware Error]: CPU 0: Machine Check Exception: 5 Bank 4: b200000011000402
kernel: [Hardware Error]: RIP !INEXACT! 33:<0000003a9dd9bbb0>
kernel: [Hardware Error]: TSC 1a61fb503eb62
kernel: [Hardware Error]: PROCESSOR 0:206a7 TIME 1378866642 SOCKET 0 APIC 0
kernel: [Hardware Error]: Some CPUs didn't answer in synchronization
kernel: [Hardware Error]: Machine check: Processor context corrupt
kernel: Kernel panic - not syncing: Fatal machine check on current CPU
kernel: Disabling lock debugging due to kernel taint
kernel: [Hardware Error]: CPU 4: Machine Check Exception: 5 Bank 4: b200000011000402
kernel: [Hardware Error]: RIP !INEXACT! 10:<ffffffff812d3a71> {intel_idle+0xb1/0x170}
kernel: [Hardware Error]: TSC 25a05e2014bf8
kernel: [Hardware Error]: PROCESSOR 0:206a7 TIME 1379010691 SOCKET 0 APIC 1
kernel: [Hardware Error]: Some CPUs didn't answer in synchronization
kernel: [Hardware Error]: Machine check: Processor context corrupt
kernel: Kernel panic - not syncing: Fatal machine check on current CPU
# mcelog --ascii
CPU 4: Machine Check Exception: 5 Bank 4: b200000011000402
RIP !INEXACT! 10:<ffffffff812d3a71> {intel_idle+0xb1/0x170}
TSC 1a61fb503eb7c
PROCESSOR 0:206a7 TIME 1378866642 SOCKET 0 APIC 1
Hardware event. This is not a software error.
CPU 4 BANK 4 TSC 1a61fb503eb7c
RIP !INEXACT! 10:ffffffff812d3a71
TIME 1378866642 Tue Sep 10 22:30:42 2013
MCG status:RIPV MCIP
MCi status:
Uncorrected error
Error enabled
Processor context corrupt
MCA: Internal unclassified error: 402
PCU: No error <24:11>
STATUS b200000011000402 MCGSTATUS 5
CPUID Vendor Intel Family 6 Model 42
RIP: intel_idle+0xb1/0x170}
SOCKET 0 APIC 1
Environment
- Lenovo M91p 4524CTO Workstation
- kernel-2.6.32-358.el6 & later
Subscriber exclusive content
A Red Hat subscription provides unlimited access to our knowledgebase, tools, and much more.