specific guidance on tuning server as compute engine

Posted on

Hi,

I looked through the knowledge base and have read Shak & Larry's slides on tuning. There are some things I am trying to do and was hoping for some more specific feedback on what will and won't be possible. The tools lock us into RHEL 6 for at least the next year.

Our situation is that we have analog chip designer for whom analog simulation is a critical part of their job and they spend quite a bit of heir day waiting for sims to complete. Turning one job around as fast as possible is more important than getting 50 things done in parallel. I am building a system with dual Xeon e5-2643v2 processors (6 cores and 25MB L3 cache each, NUMA), lots of ram and local SSD disks for all the simulation output files. There will be a NAS with 64GB of ram caching all the design files and connected by a 10G ethernet link, so I'm not worried about getting the sims started quickly. The critical single sims run from 1-20 minutes long. There are other sims that run for days, but my focus is all on the sims that are in the critical path of the daily design work. There will be other work on the computer, but as long as these don't perform terribly, they can suffer from optimal.

I want to allocate 8 of the 12 cores to the simulation jobs, and was looking at using cgred with the executable name to make sure things get put in the right group. I am assuming that I would grab 4 of 6 cores from each chip to the sim cgroup. How do cgred and numad play together? I would certainly want the jobs to be scheduled on the same processor as the memory, but it is not clear to me what happens when the physical memory for 5 jobs are allocated on 1 processor. It would be ideal if the physical memory allocations are distributed, but I don't think that can be controlled.

The next thing is that if there were only two sims running, I would want them on different processors. I have extra large heat sinks to try to keep the processors cool, and want to get as much advantage as possible from Turbo clocking when the chip isn't at max heat dissipation. Is there any way to get this to happen automatically. I have thought about doing my own tiny allocater using one cgroup per processor/mem controller and then setting the job with a group assignment. That's a lot of work and I wold love to avoid that.

The next question is about how far I can push the schedulers and still not suck on interactive tools. The suggestion was to move from 5us to 15us for the scheduler quantum. I was going to try 25us or even 40us. There will be diminishing return, but every percent counts. What are thoughts about where is becomes pointless for the sims and will have severe impact on the other apps. Remember that I still have for fast cores and a load of memory to keep the other things humming. I also am thinking about using noop io scheduling, but I don't have any feel about what this would do.

Another question is about huge pages. I will allocate a bunch of huge pages (either 16 or 32G worth of 2M pages) to be used in transparent mode. Is there any way to allocate these to cgroups or other actions that will keep the majority of those for the sims and let other apps have the TLB missed?

I didn't fully understand interleaving and how it would work with the Xeon memory controllers. Is this something I should be using? If I can keep all 4 memory controllers hot, it's a win.

The system should be together in a week and I will start tuning and testing. If people are interested, I will write up the results of my experience. Running cachegrind and numastat should start to give me a picture of how these things work.

thanks in advance,
jerry

Close

Welcome! Check out the Getting Started with Red Hat page for quick tours and guides for common tasks.