RHEL 5.7 huge load average

Latest response

Hello,
After last kernel update I faced big load average, more than 220 for a server with 16 virtual CPU, 16GB RAM, 4 GB swap while just half of RAM is consumed and none swap.
Server mail role is sendmail and ps aux |egrep "D|Ds" shows more than 170 sendmail process. Normally server manage ~100000 emails/hour and before update, same amount of emails rised load to max. ~80.
Now ps alx show wchan as "-" so there is no clue related with what I/O are process waiting for ( i dont how it was before as i had no that problem)
Actual kernel version is kernel-2.6.18-371.8.1.el5.
I may add that no changes in configurations, mailqueue size or traffic occured.
Any advice is welcome.

Thank you

Responses

You could check what your IO scheduler is. It may perform better if you switch from cfq to deadline.

cat /sys/block//queue/scheduler

The one inside [ ] is currently selected.

To update use:

echo sched_name > /sys/block//queue/scheduler

Or append elevator=deadline to the kernel line in grub.conf

See this article for details:

https://access.redhat.com/documentation/en-US/Red_Hat_Enterprise_Linux/5/html/Tuning_and_Optimizing_Red_Hat_Enterprise_Linux_for_Oracle_9i_and_10g_Databases/sect-Oracle_9i_and_10g_Tuning_Guide-Kernel_Boot_Parameters-The_IO_Scheduler.html

Though this doesn't answer why the load has increased, it may well help.

The post has stripped the disk name from the above cat and echo commands. Should be:

cat /sys/block/sda/queue/scheduler or cat /sys/block/vda/queue/scheduler

echo sched_name > /sys/block/sda/queue/scheduler or echo sched_name > /sys/block/vda/queue/scheduler

Andy, try putting commands and code blocks with a line of "~~~" above and below. It will look

like this

Hello,
Thank you for info, i forgot about this feature.
As sendmail keep i/o more with write i will keep cfq but with 0 as /sys/block/sda/queue/iosched/slice_idle and echo 3 >/sys/block/sda/queue/iosched/slice_async_rq.
I will come back with details

Also - sendmail typically warrants specific mount options if your system is particularly demanding...

I believe relatime is an accepted practice for /var/spool (some question whether to use noatime) - I, personally, don't know enough about this topic any more ;-)

There is a book "Sendmail Performance Tuning" By Nick Christenson which talked a bit about the mount options, etc... (page 40).

The whole book might be available on Google Books, but you should be able to find the selection I am referring to by Googling "sendmail performance /etc/fstab"

Hello

I used noatime, to mount mail queue filesystem, when I had big mailqueue, but were none significat improvments. I decreased mailqueue adding new CPU and physical memory.
Now, after a period when everything was OK, i had just big load, but mailqueue remained under 1000 emails while server proceesed more than 80000 emails/hour. And because that issue, came like "over night" I was curios if it may be related with RHSA-2014:0740 - syscal bug for futex.
Thank you

These types of problems are certainly fun (seemingly random without a good identifier as to what the issue might be ;-)

I had ran in to a similar issue a while back, also on RHEL 5, with the NAMED service on the box. I believe POSIX permissions and SElinux had been altered by something and named was consuming an entire processor.

I would recommend opening a case to have Red Hat dig up the possible related causes for you.

unfortunatelly i have no support to open that case

maybe somebody have a clue related with this trace:
% time seconds usecs/call calls errors syscall


28.25 0.006120 16 376 51 open
23.04 0.004991 59 85 fsync
22.16 0.004801 282 17 clone
11.36 0.002460 66 37 20 wait4
4.63 0.001004 3 391 170 stat
4.61 0.000999 14 69 69 unlink
4.31 0.000934 0 2751 86 read
0.51 0.000110 0 2615 select
0.38 0.000082 0 748 write
0.22 0.000047 0 1933 fcntl
0.11 0.000024 0 336 close
0.10 0.000022 0 250 lstat
0.09 0.000019 0 966 fstat
0.08 0.000018 0 675 pread
0.07 0.000015 0 233 geteuid
0.04 0.000009 0 628 lseek
0.04 0.000008 0 102 umask
0.00 0.000000 0 18 poll
0.00 0.000000 0 3 rt_sigaction
0.00 0.000000 0 6 rt_sigprocmask
0.00 0.000000 0 18 rt_sigreturn
0.00 0.000000 0 2 dup2
0.00 0.000000 0 3 alarm
0.00 0.000000 0 9 socket
0.00 0.000000 0 9 connect
0.00 0.000000 0 26 sendto
0.00 0.000000 0 9 recvfrom
0.00 0.000000 0 1 kill
0.00 0.000000 0 4 ftruncate
0.00 0.000000 0 17 rename
0.00 0.000000 0 1 setuid
0.00 0.000000 0 1 statfs

Thank you

Hello,
I think this topic may be closed.
Seems this is tipical behavior for sendmail in this enviroment and this is not affecting server performances.

Thanks for resolving this, Bohumil.