System hang due to processes being starved by the CFS scheduler
Issue
- Unexpected reboots for 2 production modules
- Two systems at a critical production site running RHEL6U5, encountered unexpected reboots within a space of 1 hour.
- Both production systems have been running in stable state for around 3 months.
- Both Systems seem to have been running with normal load pattern until 15 minutes prior to incident
- Within few minutes, there was sudden increase of load by all application related java processes, and load average went from 1+ to 200+
- BMC Watchdog timer did not get periodic heartbeats for 120 seconds from OS, and hence an NMI was sent from BMC to Host OS.
- Evaluation of possible abnormal network traffic and/or application issues as the cause of sudden load increase is still in progress.
Environment
- Red Hat Enterprise Linux 6.5
- CFS scheduler
- Software RAID storage configuration using
mdraid1
Subscriber exclusive content
A Red Hat subscription provides unlimited access to our knowledgebase, tools, and much more.