Intermittent slow response on SuperMicro SuperServer X8DTT running RHEL 6.1 64bit
We have about a dozen SuperMicro SuperServer model X8DTT running RHEL 6.1 64bit. The exact kernel version from uname is 2.6.32-131.0.15.el6.x86_64. We've noticed many occurrences when a server becomes unresponsive for a period of time. Open ssh/terminal windows that were working just fine suddenly slow down and exhibit response times (i.e., time to echo typed characters) measured in tens of seconds. The machines are not heavily loaded, are not swapping, are not filling a log with disk error messages, etc. I've never found any indication at all in /var/log/messages. A reboot fixes this symptom in my experience; alternately I just wait 5-10 minutes and it goes away. Telllingly, the problem has occurred on a server that was previously running RedHat 5.5, and while running that version it had no such problem. So I am leaning towards a hypothesis of a RedHat 6.1 kernel problem on this particular motherboard.
Responses
Christopher,
Can you show the output of top again, but after you pushed the 1?
It would look like this:
================================================================================
top - 06:00:11 up 7 min, 2 users, load average: 1.08, 1.18, 0.68 Tasks: 299 total, 1 running, 298 sleeping, 0 stopped, 0 zombie Cpu0 : 1.3%us, 0.3%sy, 0.0%ni, 98.3%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu1 : 0.0%us, 0.0%sy, 0.0%ni,100.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu2 : 0.3%us, 0.3%sy, 0.0%ni, 99.3%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu3 : 0.3%us, 0.7%sy, 0.0%ni, 97.7%id, 1.3%wa, 0.0%hi, 0.0%si, 0.0%st Mem: 3649432k total, 1222336k used, 2427096k free, 80372k buffers Swap: 2097144k total, 0k used, 2097144k free, 492788k cached PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 2966 root 20 0 156m 33m 13m S 1.3 0.9 0:10.51 Xorg 3632 root 20 0 851m 136m 31m S 0.7 3.8 0:24.45 firefox 1850 root 20 0 0 0 0 S 0.3 0.0 0:00.15 kondemand/1 2291 root 20 0 20204 1252 1068 S 0.3 0.0 0:00.07 hald-addon-stor 3292 root 20 0 331m 17m 11m S 0.3 0.5 0:00.56 gnome-panel 3784 root 20 0 299m 14m 10m S 0.3 0.4 0:00.33 gnome-terminal 1 root 20 0 21396 1544 1240 S 0.0 0.0 0:01.13 init 2 root 20 0 0 0 0 S 0.0 0.0 0:00.00 kthreadd 3 root RT 0 0 0 0 S 0.0 0.0 0:00.24 migration/0 4 root 20 0 0 0 0 S 0.0 0.0 0:00.01 ksoftirqd/0 5 root RT 0 0 0 0 S 0.0 0.0 0:00.40 migration/0 6 root RT 0 0 0 0 S 0.0 0.0 0:00.00 watchdog/0 7 root RT 0 0 0 0 S 0.0 0.0 0:00.88 migration/1 8 root RT 0 0 0 0 S 0.0 0.0 0:00.40 migration/1 9 root 20 0 0 0 0 S 0.0 0.0 0:00.01 ksoftirqd/1 10 root RT 0 0 0 0 S 0.0 0.0 0:00.00 watchdog/1 11 root RT 0 0 0 0 S 0.0 0.0 0:00.17 migration/2 12 root RT 0 0 0 0 S 0.0 0.0 0:00.40 migration/2 13 root 20 0 0 0 0 S 0.0 0.0 0:00.02 ksoftirqd/2 14 root RT 0 0 0 0 S 0.0 0.0 0:00.00 watchdog/2 15 root RT 0 0 0 0 S 0.0 0.0 0:00.20 migration/3 16 root RT 0 0 0 0 S 0.0 0.0 0:00.40 migration/3 17 root 20 0 0 0 0 S 0.0 0.0 0:00.02 ksoftirqd/3 18 root RT 0 0 0 0 S 0.0 0.0 0:00.00 watchdog/3 19 root 20 0 0 0 0 S 0.0 0.0 0:00.07 events/0 20 root 20 0 0 0 0 S 0.0 0.0 0:00.05 events/1 21 root 20 0 0 0 0 S 0.0 0.0 0:00.17 events/2 22 root 20 0 0 0 0 S 0.0 0.0 0:00.03 events/3
===========================================================
Kind regards,
ir. Jan Gerrit Kootstra
Hi,
Well it shows that the average over all CPUs is not "filtering" information.
Does the hardware have an event monitoring?
If you suspect the kernel, did you open a support case at Red Hat?
Kind regards,
ir Jan Gerrit
P.S. ksar is a java code that can create a report with performance graphs as a PDF. So you may see whether there are drops in CPU activity or a disk getting high demands (iowait is low, so maybe not).