`oom-killer` problems on RHEL 5.5 32bit (w/ PAE kernel)
First, I know that RHEL 5.5 is kinda ancient. We're running an application that the vendor requires 32-bit RHEL 5 for. The app's recommended memory minimums are 16GB (Oracle with a Java app-stack). The deployed system has 48GB of RAM installed.
Unfortunately, every few hours, the system *thinks* that it's running out of memory. This triggers oom-killer, which, in turn, kills Oracle, kills the app-stack, kills monitoring tools, kills our AD authentication connector ...pretty much kills everything.
I've set up sadc to collect every 2 minutes (previously was set to 10 minute probes). At the time that oom-killer gets triggered, sadc is reporting that memory utilization is only at about 10-20% and that swap is pretty much unused. CPU, during these evens, spikes at about a 4 load-average (which, on a dual quad-core system, is nothing). Only real indication of something odd going on is that the sadc probe at the time oom-killer starts going nuts is the system's context-switching goes up by more than 10x.
This cswitch-spike is the only point of consistency. There's no time or date consistency to the events. There's nothing in the system or audit logs indicating some process has gone rogue and triggering oom-killer from memory starvation. Hell, even when oom-killer kicks off and sites "low memory", it's claiming low memory but indicating that MAX MEM/SWAP and FREE MEM/SWAP are just about equal.
Anyone seen this behavior before?
Responses
From:
Red Hat dropped the "hugemem" (4/4G split model) kernel after RHEL4. The 4/4GiB model drastically relieved the pressure on the amount of restrained LOWMEM (in the 3/1G model) required to support paging in 36-bit PAE (above 4GiB).
Unfortunately, as I understand it, upstream never supported it. It also caused a major performance hit since the 4/4G model for kernel and userspace pages required many extra TLB flushes. I know with "hugemem" the 32-bit JRE used to take up to a 30% performance hit (at least that was my experience in the FSI realm). At the same time, we used to see all sorts of issues with the 36-bit PAE kernel beyond 8GiB until a number of patches were made late in RHEL4. After that 16GiB was usable with the "smp" kernel.
Even with these patches in late RHEL4, and also in RHEL5, to the enduring 3/1G model, it's still not viable once you cross 16GiB. It starts eating so much LOWMEM that many processes will just start to exhibit exactly what you describe. I'm surprised you've had any success with a system beyond 32GiB. If you need more than 16GiB, you need to be running x86-64 as of RHEL5. Period.
IIRC, mapping the >4GiB pages for 36-bit PAE in the Linux/x86 kernel requires 16MiB per 1GiB. That's 512MiB of kernel -- half of kernel's LOWMEM in the 3/1G model IIRC -- for 32GiB. I could be rusty on this information (it's been almost 5 years). Remember that paging, kernel, select memory mapped I/O, etc... must fit in this constrained area under the 36-bit PAE model.
x86-64 became well profilerated by 2007, so I don't blame Red Hat for dropping the 4/4G option as of RHEL5. The x86-64 platform introduces a new "flat" 48-bit addressing and 52-bit PAE model (making it compatible with segmented, 48-bit pointers, 32-bit flat addressing and 36-bit PAE), without all of the legacy paging to LOWMEM. Again, if you need more than 16GiB, you need to be using x86-64 for RHEL5.
If you must stay RHEL5 x86, then consider physically removing memory, or artificially limiting memory with the kernel boot parameters highmem=/mem=. Try 32GiB, then drop to 16GiB if required. I really don't recommend more than 16GiB, and Red Hat's own link above spells that out too.
Another option is to talk to your ISV. They should at least support EL x86-64, even if you have to load a 32-bit JRE/JDK and other 32-bit libraries (as they cannot use libraries written for the native 52-bit PAE model in x86-64). Red Hat ships a number of 32-bit libraries that are compatible with 32-bit and 36-bit PAE software in EL x86-64. For those that Red Hat does not, your ISV should provide them.
You're honestly in uncharted territory with RHEL5+ on 32-bit beyond 16GiB (let alone 32GiB). Come to think of it, I don't know of any x86 OS that supported more than 32GiB, other than Red Hat EL3/4 with the split 4/4G model. Even Microsoft stopped at 32GiB (although their docs can be conflicting, some 2003R2 Datacenter state 32GiB while others state 64GiB, but others clarify that is only for x64) because of the sheer paging overhead below 4GiB (using a 29-bit/512MiB paging model IIRC).
Yeah, PAE is a PC thing. It's a long story. All i686 compatible processors support 36-bit PAE (64GiB, P3 and later actually support 36-bit PSE, but that's another story). Unfortunately, you still have the constraints of LOWMEM, 32-bit flat (4GiB) memory allocation.
Depending on the kernel memory model, there are either kernel v. memory constraints, or paging performance hits as kernel gets its own 4G separate from 4G user pages. This is because, again, x86 is still limited to a 32-bit flat memory allocation. Red Hat removed the 4/4G model as of RHEL5, as I mentioned.
Most everyone today is running x86-64, which is 52-bit PAE (4EiB), 48-bit flat (256TiB), to be compatible with both 36-bit PAE (64GiB) and 48-bit segmented pointers normalized to 32-bit flat (4GiB). Of course, pointers aren't compatible across memory models, hence the 32-bit v. 64-bit library incompatibilities.
Again, x86-64 kernels can run both types of programs and libraries. The problem is when you have a legacy x86 program, but not the required, x86 libraries. Red Hat does not bundle all x86 libraries in x86-64, for various reasons. So it falls to your ISV to do such, so it works on both x86 and x86-64. There can be other issues as well.
Don't know what to tell you other than leveraging your ISV and working with them.
You might be starved of low memory.
Look at /proc/meminfo or free -l under periods of heavy activity and see if you are running low on "low memory."
If so, you should adjust /proc/sys/vm/lower_zone_protection. 100-250 is reasonable.
To set this option on boot, add the following to /etc/sysctl.conf:
vm.lower_zone_protection = 100
Here is an article on Memory Management in the Linux 2.6 kernel.
http://www.ibm.com/developerworks/library/l-mem26/
Linus wrote a nice post on lower_zone_protection on LKML a while back, but I can't seem to find it.
BTW: to answer your question -- we saw this commonly enough in our environment that we set vm.lower_zone_protection to 100 out of the box (and have increased it higher under circumstances where there was additional memory pressure.)
Phil -- I didn't even know about this tuning. Thanx for pointing it out. All I know is that LOWMEM can be contrained in 36-bit PAE (x86). I know with the PAE model, the more you are beyond 4GiB, the more LOWMEM it consumes.
The one quesiton I have is if you were running with large memory like the OP (48GiB), or were you at 16GiB (or lower)?
I have not looked at this since the RHEL4 days and when we benchmarked "hugemem" v. "smp" and looked at how much LOWMEM was being consumed with 8GiB, 16GiB and 32GiB systems. In a nutshell, it always seemd that I should use "hugemem" kernel with anything over 8GiB as a rule of thumb, and absolutely beyond 16GiB (even after some of the PAE patches in 2.6.1x and later).
This is one of those things that you really need to talk to your ISV about and/or test yourself. If the application was bundled with EL, or commonly deployed, we may be able to provide more input.
But other than explaining why you're having issues generically, and offering several, plausible solutions to mitigate your risks, it's really up to you on which will work better for your ISV's product and your systems running it.
With the x86 architecture the first 16MB-896MB of physical memory is known as "low memory" (ZONE_NORMAL) which is permanently mapped into kernel space. Many kernel resources must live in the low memory zone. In fact, many kernel operations can only take place in this zone. This means that the low memory area is the most performance critical zone. For example, if you run many resources intensive applications/programs and/or use large physical memory, then "low memory" can become low since more kernel structures must be allocated in this area. Under heavy I/O workloads the kernel may become starved for LowMem even though there is an abundance of available HighMem. As the kernel tries to keep as much data in cache as possible this can lead to oom-killers or complete system hangs.
In 64-bit systems all the memory is allocated in ZONE_NORMAL. So lowmem starvation will not affect 64-bit systems. Moving to 64-bit would be a permanent fix for lowmem starvation.
Source: https://access.redhat.com/kb/docs/DOC-52977