System hang due to processes being starved by the CFS scheduler

Solution Unverified - Updated 2024-08-05T06:38:10+00:00 -

Issue

Unexpected reboots for 2 production modules
Two systems at a critical production site running RHEL6U5, encountered unexpected reboots within a space of 1 hour.
Both production systems have been running in stable state for around 3 months.
Both Systems seem to have been running with normal load pattern until 15 minutes prior to incident
Within few minutes, there was sudden increase of load by all application related java processes, and load average went from 1+ to 200+
BMC Watchdog timer did not get periodic heartbeats for 120 seconds from OS, and hence an NMI was sent from BMC to Host OS.
Evaluation of possible abnormal network traffic and/or application issues as the cause of sudden load increase is still in progress.

A Red Hat subscription provides unlimited access to our knowledgebase, tools, and much more.