Intermittent slow response on SuperMicro SuperServer X8DTT running RHEL 6.1 64bit

Latest response

We have about a dozen SuperMicro SuperServer model X8DTT running RHEL 6.1 64bit. The exact kernel version from uname is  2.6.32-131.0.15.el6.x86_64. We've noticed many occurrences when a server becomes unresponsive for a period of time. Open ssh/terminal windows that were working just fine suddenly slow down and exhibit response times (i.e., time to echo typed characters) measured in tens of seconds. The machines are not heavily loaded, are not swapping, are not filling a log with disk error messages, etc.  I've never found any indication at all in /var/log/messages.  A reboot fixes this symptom in my experience; alternately I just wait 5-10 minutes and it goes away. Telllingly, the problem has occurred on a server that was previously running RedHat 5.5, and while running that version it had no such problem.  So I am leaning towards a hypothesis of a RedHat 6.1 kernel problem on this particular motherboard.

Responses

top - 14:08:02 up 20 days,  3:04,  5 users,  load average: 8.55, 5.35, 2.39
Tasks: 434 total,   2 running, 432 sleeping,   0 stopped,   0 zombie
Cpu(s):  0.0%us,  0.0%sy,  0.0%ni, 99.9%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Mem:  24598960k total,  9500564k used, 15098396k free,   233628k buffers
Swap:  4095992k total,        0k used,  4095992k free,  5524624k cached

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND            
 7224 tomcat    20   0 56.4g 2.3g 413m S  0.7 10.0   3:26.53 java               
 1879 root      20   0  9140  620  412 S  0.3  0.0   8:11.09 irqbalance         
21709 abc       20   0 15284 1492  932 R  0.3  0.0   0:00.19 top                
    1 root      20   0 19336 1500 1212 S  0.0  0.0   0:03.41 init               
    2 root      20   0     0    0    0 S  0.0  0.0   0:00.01 kthreadd           
    3 root      RT   0     0    0    0 S  0.0  0.0   0:00.00 migration/0        
    4 root      20   0     0    0    0 S  0.0  0.0   0:00.24 ksoftirqd/0        
    5 root      RT   0     0    0    0 S  0.0  0.0   0:00.00 migration/0        
    6 root      RT   0     0    0    0 S  0.0  0.0   0:00.00 watchdog/0         
    7 root      RT   0     0    0    0 S  0.0  0.0   0:00.00 migration/1        
    8 root      RT   0     0    0    0 S  0.0  0.0   0:00.00 migration/1        
    9 root      20   0     0    0    0 S  0.0  0.0   0:00.06 ksoftirqd/1        
   10 root      RT   0     0    0    0 S  0.0  0.0   0:00.00 watchdog/1         
   11 root      RT   0     0    0    0 S  0.0  0.0   0:00.00 migration/2        
   12 root      RT   0     0    0    0 S  0.0  0.0   0:00.00 migration/2        
   13 root      20   0     0    0    0 S  0.0  0.0   0:00.05 ksoftirqd/2        
   14 root      RT   0     0    0    0 S  0.0  0.0   0:00.00 watchdog/2

The previous post shows the output from top when the strange slowdown happens.  The load average is high but top shows the system nearly 100% idle, which seems like a contradiction to me.  Please tell me what additional evidence I can collect and how, thanks for reading this.

Christoper,

 

 

How many vCPU does this example server have?

 

If it has multiple like multi core CPU or multiple sockets, so output of top after you pushed the 1 key.

 

 

Maybe only 1 thread bus or core is working but the average is low.

 

 

Kind regards,

 

 

ir Jan Gerrit Kootstra

Right, the machine has 2 sockets, both populated with 4-core processors, and Intel says "hyperthreading" so RHEL sees 8 cores each; total of 16.

Christopher,

 

 

Can you show the output of top again, but after you pushed the 1?

 

It would look like this:

================================================================================

top - 06:00:11 up 7 min,  2 users,  load average: 1.08, 1.18, 0.68
Tasks: 299 total,   1 running, 298 sleeping,   0 stopped,   0 zombie
Cpu0  :  1.3%us,  0.3%sy,  0.0%ni, 98.3%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu1  :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu2  :  0.3%us,  0.3%sy,  0.0%ni, 99.3%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu3  :  0.3%us,  0.7%sy,  0.0%ni, 97.7%id,  1.3%wa,  0.0%hi,  0.0%si,  0.0%st
Mem:   3649432k total,  1222336k used,  2427096k free,    80372k buffers
Swap:  2097144k total,        0k used,  2097144k free,   492788k cached

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND                                                                     
 2966 root      20   0  156m  33m  13m S  1.3  0.9   0:10.51 Xorg                                                                        
 3632 root      20   0  851m 136m  31m S  0.7  3.8   0:24.45 firefox                                                                     
 1850 root      20   0     0    0    0 S  0.3  0.0   0:00.15 kondemand/1                                                                 
 2291 root      20   0 20204 1252 1068 S  0.3  0.0   0:00.07 hald-addon-stor                                                             
 3292 root      20   0  331m  17m  11m S  0.3  0.5   0:00.56 gnome-panel                                                                 
 3784 root      20   0  299m  14m  10m S  0.3  0.4   0:00.33 gnome-terminal                                                              
    1 root      20   0 21396 1544 1240 S  0.0  0.0   0:01.13 init                                                                        
    2 root      20   0     0    0    0 S  0.0  0.0   0:00.00 kthreadd                                                                    
    3 root      RT   0     0    0    0 S  0.0  0.0   0:00.24 migration/0                                                                 
    4 root      20   0     0    0    0 S  0.0  0.0   0:00.01 ksoftirqd/0                                                                 
    5 root      RT   0     0    0    0 S  0.0  0.0   0:00.40 migration/0                                                                 
    6 root      RT   0     0    0    0 S  0.0  0.0   0:00.00 watchdog/0                                                                  
    7 root      RT   0     0    0    0 S  0.0  0.0   0:00.88 migration/1                                                                 
    8 root      RT   0     0    0    0 S  0.0  0.0   0:00.40 migration/1                                                                 
    9 root      20   0     0    0    0 S  0.0  0.0   0:00.01 ksoftirqd/1                                                                 
   10 root      RT   0     0    0    0 S  0.0  0.0   0:00.00 watchdog/1                                                                  
   11 root      RT   0     0    0    0 S  0.0  0.0   0:00.17 migration/2                                                                 
   12 root      RT   0     0    0    0 S  0.0  0.0   0:00.40 migration/2                                                                 
   13 root      20   0     0    0    0 S  0.0  0.0   0:00.02 ksoftirqd/2                                                                 
   14 root      RT   0     0    0    0 S  0.0  0.0   0:00.00 watchdog/2                                                                  
   15 root      RT   0     0    0    0 S  0.0  0.0   0:00.20 migration/3                                                                 
   16 root      RT   0     0    0    0 S  0.0  0.0   0:00.40 migration/3                                                                 
   17 root      20   0     0    0    0 S  0.0  0.0   0:00.02 ksoftirqd/3                                                                 
   18 root      RT   0     0    0    0 S  0.0  0.0   0:00.00 watchdog/3                                                                  
   19 root      20   0     0    0    0 S  0.0  0.0   0:00.07 events/0                                                                    
   20 root      20   0     0    0    0 S  0.0  0.0   0:00.05 events/1                                                                    
   21 root      20   0     0    0    0 S  0.0  0.0   0:00.17 events/2                                                                    
   22 root      20   0     0    0    0 S  0.0  0.0   0:00.03 events/3  

===========================================================

 

 

Kind regards,

 

 

ir. Jan Gerrit Kootstra

Thanks for the suggestion, next time this symptom appears I will do exactly that.  Unfortunately I have not yet found a way to force the slowdown to occur.

success in analysing the next occurence.

 

running ksar on the sar data may indicate some issues.

top - 13:53:31 up 52 days, 56 min,  3 users,  load average: 2.65, 1.32, 0.52
Tasks: 362 total,   1 running, 361 sleeping,   0 stopped,   0 zombie
Cpu0  :  0.0%us,  0.0%sy,  0.0%ni, 99.9%id,  0.1%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu1  :  0.0%us,  0.0%sy,  0.0%ni, 99.9%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu2  :  0.0%us,  0.0%sy,  0.0%ni, 99.9%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu3  :  0.0%us,  0.0%sy,  0.0%ni, 99.9%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu4  :  0.0%us,  0.0%sy,  0.0%ni, 99.9%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu5  :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu6  :  0.0%us,  0.0%sy,  0.0%ni, 99.9%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu7  :  0.0%us,  0.0%sy,  0.0%ni, 99.9%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu8  :  0.0%us,  0.0%sy,  0.0%ni, 99.9%id,  0.1%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu9  :  0.0%us,  0.0%sy,  0.0%ni, 99.9%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu10 :  0.0%us,  0.0%sy,  0.0%ni, 99.9%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu11 :  0.0%us,  0.0%sy,  0.0%ni, 99.9%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu12 :  0.0%us,  0.0%sy,  0.0%ni, 99.9%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu13 :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu14 :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu15 :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Mem:  24598952k total,  4436196k used, 20162756k free,   166732k buffers
Swap:  4095992k total,        0k used,  4095992k free,  1934532k cached

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND                                             
 2532 mysql     20   0 11.2g 664m 5420 S  0.2  2.8  45:34.06 mysqld                                              
25628 abc       20   0 97636 1760  920 S  0.2  0.0   0:00.03 sshd                                                
25652 abc       20   0 15152 1424  932 R  0.2  0.0   0:00.07 top                                                 
32173 abc       20   0 4889m 951m  11m S  0.2  4.0  70:59.55 java                                                
    1 root      20   0 19328 1552 1252 S  0.0  0.0   0:03.10 init                                                
    2 root      20   0     0    0    0 S  0.0  0.0   0:00.00 kthreadd                                            
    3 root      RT   0     0    0    0 S  0.0  0.0   0:00.00 migration/0                                         
    4 root      20   0     0    0    0 S  0.0  0.0   0:01.90 ksoftirqd/0                                         
    5 root      RT   0     0    0    0 S  0.0  0.0   0:00.00 migration/0                                         
    6 root      RT   0     0    0    0 S  0.0  0.0   0:00.00 watchdog/0                                          
    7 root      RT   0     0    0    0 S  0.0  0.0   0:00.00 migration/1                                         
    8 root      RT   0     0    0    0 S  0.0  0.0   0:00.00 migration/1                                         
    9 root      20   0     0    0    0 S  0.0  0.0   0:00.10 ksoftirqd/1                                         
   10 root      RT   0     0    0    0 S  0.0  0.0   0:00.00 watchdog/1                                          
   11 root      RT   0     0    0    0 S  0.0  0.0   0:00.00 migration/2                                         
   12 root      RT   0     0    0    0 S  0.0  0.0   0:00.00 migration/2                                         
   13 root      20   0     0    0    0 S  0.0  0.0   0:10.11 ksoftirqd/2                                         
   14 root      RT   0     0    0    0 S  0.0  0.0   0:00.00 watchdog/2                                          
   15 root      RT   0     0    0    0 S  0.0  0.0   0:00.00 migration/3                                         
   16 root      RT   0     0    0    0 S  0.0  0.0   0:00.00 migration/3                       

I welcome more guidance on ksar and sar.  Frankly I have to google them because I have never used those.

 

Above please see a pasted output from top with info on individual CPUs per your suggestion.  The machine was in one of its periodic slowdowns at the time.  Does this give any insight?  I'm not quite certain what to look for here.

Hi,

 

 

Well it shows that the average over all CPUs is not "filtering" information.

 

Does the hardware have an event monitoring?

 

If you suspect the kernel, did you open a support case at Red Hat?

 

 

Kind regards,

 

 

ir Jan Gerrit

 

P.S. ksar is a java code that can create a report with performance graphs as a PDF. So you may see whether there are drops in CPU activity or a disk getting high demands (iowait is low, so maybe not).

Thank you very much for the pointer!  Would you please explain what these kernel parameters mean and how they help?

intel_idle.max_cstate=0 processor.max_cstate=1

I tried to follow the link in Red Hat bug report 669453 to bug 710265 but RedHat bugzilla says "Access Denied", like it's secret or something.

This link has much discussion from the usenet group linux.hardware.dell.poweredge:

 

http://www.digipedia.pl/usenet/thread/18711/41667/

 

Would someone like to explain, please, what is a C state and why it's causing RHEL 6.1 so much grief?  Thanks in advance!

I see now that I omitted the processor details.  We see this problem on machines with Intel Xeon E5540 processors.  I now know that C states are are idle (power-saving) states.  I'm guessing that RHEL is struggling with managing the processor's idle states.  These blog posts at Intel explain C states:

 

http://software.intel.com/en-us/blogs/2008/03/27/update-c-states-c-states-and-even-more-c-states/

 

http://software.intel.com/en-us/blogs/2008/04/29/theres-got-to-be-a-catch/

Well, this works on RHEL 6. Two weeks ago I changed /etc/grub.conf to add the following arguments at the end of the "kernel" line.  We have not had a slowdown problem since that time.

 

    intel_idle.max_cstate=0 processor.max_cstate=1