First, I know that RHEL 5.5 is kinda ancient. We're running an application that the vendor requires 32-bit RHEL 5 for. The app's recommended memory minimums are 16GB (Oracle with a Java app-stack). The deployed system has 48GB of RAM installed.
Unfortunately, every few hours, the system *thinks* that it's running out of memory. This triggers oom-killer, which, in turn, kills Oracle, kills the app-stack, kills monitoring tools, kills our AD authentication connector ...pretty much kills everything.
I've set up sadc to collect every 2 minutes (previously was set to 10 minute probes). At the time that oom-killer gets triggered, sadc is reporting that memory utilization is only at about 10-20% and that swap is pretty much unused. CPU, during these evens, spikes at about a 4 load-average (which, on a dual quad-core system, is nothing). Only real indication of something odd going on is that the sadc probe at the time oom-killer starts going nuts is the system's context-switching goes up by more than 10x.
This cswitch-spike is the only point of consistency. There's no time or date consistency to the events. There's nothing in the system or audit logs indicating some process has gone rogue and triggering oom-killer from memory starvation. Hell, even when oom-killer kicks off and sites "low memory", it's claiming low memory but indicating that MAX MEM/SWAP and FREE MEM/SWAP are just about equal.
Anyone seen this behavior before?