[RHEL5.5] kblockd high cpu - required reboot

Solution Unverified - Updated -

Environment

  • Red Hat Enterprise Linux 5.5
    • kernel 2.6.18-194.17.4.el5
  • Large IO subsystem (/proc/partitions almost 7000 entries)
  • EMC Powerpath
EMCpower.LINUX-5.3.1.00.00-111.x86_64
  • Veritas VCS, Veritas DMP, etc
VRTSaa-5.0.602.0-0.i686                                     Mon 18 May 2009 10:11:41 AM CDT
VRTSacclib-5.0.30.00-MP3_GENERIC.noarch                     Mon 18 May 2009 10:18:36 AM CDT
VRTSalloc-5.0.30.00-MP3_RHEL5.i686                          Mon 18 May 2009 10:14:42 AM CDT
VRTSatClient-4.3.34.4-4.i386                                Mon 18 May 2009 10:09:08 AM CDT
VRTSatServer-4.3.34.4-4.i386                                Mon 18 May 2009 10:09:45 AM CDT
VRTScavf-5.0.30.20-MP3RP2a_GENERIC.noarch                   Mon 29 Jun 2009 09:16:02 AM CDT
VRTSccg-5.0.602.0-0.i686                                    Mon 18 May 2009 10:11:14 AM CDT
VRTScmccc-5.0.30.00-MP3_RHEL5.i686                          Mon 18 May 2009 10:20:40 AM CDT
VRTScmcs-5.0.30.00-MP3_RHEL5.i686                           Mon 18 May 2009 10:20:27 AM CDT
VRTScscm-5.0.30.00-MP3_GENERIC.noarch                       Mon 18 May 2009 10:19:47 AM CDT
VRTScscw-5.0.30.00-MP3_GENERIC.noarch                       Mon 18 May 2009 10:19:34 AM CDT
VRTScssim-5.0.30.00-MP3_RHEL5.i686                          Mon 18 May 2009 10:20:01 AM CDT
VRTScutil-5.0-MP3_GENERIC.noarch                            Mon 18 May 2009 10:20:14 AM CDT
VRTSdbcom-common-5.0.30.20-MP3RP2_RHEL5.i686                Mon 29 Jun 2009 09:14:12 AM CDT
VRTSdbed-common-5.0.30.20-MP3RP2_RHEL5.i686                 Mon 29 Jun 2009 09:14:06 AM CDT
VRTSdbms3-3.0.119.0-3.0.119.0.x86_64                        Mon 18 May 2009 10:17:11 AM CDT
VRTSdcli-5.0.30.20-GA_MP3RP2_RHEL.i686                      Mon 29 Jun 2009 09:15:48 AM CDT
VRTSddlpr-5.0.30.20-GA_MP3_RHEL5.i686                       Mon 29 Jun 2009 09:15:47 AM CDT
VRTSdsa-5.0.00.00-GA_RHEL4.i686                             Mon 18 May 2009 10:13:14 AM CDT
VRTSfsman-5.0.30.00-MP3_GENERIC.noarch                      Mon 18 May 2009 10:16:31 AM CDT
VRTSfsmnd-5.0.30.00-MP3_GENERIC.noarch                      Mon 18 May 2009 10:16:57 AM CDT
VRTSfspro-5.0.30.00-MP3_RHEL5.i686                          Mon 18 May 2009 10:13:27 AM CDT
VRTSfssdk-5.0.30.30-MP3RP3_RHEL5.i686                       Sat 19 Sep 2009 06:42:45 PM CDT
VRTSgab-5.0.30.20-MP3RP2_RHEL5.x86_64                       Mon 29 Jun 2009 09:15:45 AM CDT
VRTSicsco-1.3.28.0-0.i386                                   Mon 18 May 2009 10:08:42 AM CDT
VRTSjre15-1.5.3.5-5.i386                                    Mon 18 May 2009 10:15:52 AM CDT
VRTSllt-5.0.30.20-MP3RP2_RHEL5.x86_64                       Mon 29 Jun 2009 09:15:45 AM CDT
VRTSlvmconv-5.0.30.30-MP3RP3_RHEL5.i686                     Sat 19 Sep 2009 06:42:54 PM CDT
VRTSmh-5.0.527.0-0.i686                                     Mon 18 May 2009 10:11:28 AM CDT
VRTSob-3.3.1202.0-0.i686                                    Mon 18 May 2009 10:10:43 AM CDT
VRTSobc33-3.3.1202.0-0.i686                                 Mon 18 May 2009 10:10:29 AM CDT
VRTSobgui-3.3.1202.0-0.i686                                 Mon 18 May 2009 10:11:00 AM CDT
VRTSodm-common-5.0.30.00-MP3_RHEL5.i686                     Mon 18 May 2009 10:22:04 AM CDT
VRTSodm-platform-5.0.30.30-MP3RP3_RHEL5.x86_64              Sat 19 Sep 2009 06:42:54 PM CDT
VRTSorgui-common-5.0.30.20-MP3RP2_RHEL5.i686                Mon 29 Jun 2009 09:14:08 AM CDT
VRTSpbx-1.3.28.0-0.i386                                     Mon 18 May 2009 10:08:54 AM CDT
VRTSperl-5.8.8.0-RHEL5.x86_64                               Mon 18 May 2009 10:08:27 AM CDT
VRTSspt-5.0.00.4-GA.noarch                                  Mon 18 May 2009 10:11:55 AM CDT
VRTSvcs-5.0.30.20-MP3RP2_RHEL5.i686                         Mon 29 Jun 2009 09:16:01 AM CDT
VRTSvcsag-5.0.30.20-MP3RP2_RHEL5.i686                       Mon 29 Jun 2009 09:16:02 AM CDT
VRTSvcsdr-5.0.30.00-MP3_RHEL5.x86_64                        Mon 18 May 2009 10:19:07 AM CDT
VRTSvcsmg-5.0.30.00-MP3_GENERIC.noarch                      Mon 18 May 2009 10:18:23 AM CDT
VRTSvcsmn-5.0.30.00-MP3_GENERIC.noarch                      Mon 18 May 2009 10:19:20 AM CDT
VRTSvcsor-5.0.30.00-MP3_RHEL5.i686                          Mon 18 May 2009 10:21:51 AM CDT
VRTSvcsvr-5.0.30.00-MP3_GENERIC.noarch                      Mon 18 May 2009 10:15:37 AM CDT
VRTSvdid-1.2.206.0-206.i386                                 Mon 18 May 2009 10:14:56 AM CDT
VRTSvlic-3.02.33.5500-0.x86_64                              Mon 18 May 2009 10:05:58 AM CDT
VRTSvmman-5.0.30.30-MP3RP3_GENERIC.noarch                   Sat 19 Sep 2009 06:42:54 PM CDT
VRTSvmpro-5.0.30.20-GA_RHEL5.i686                           Mon 29 Jun 2009 09:16:08 AM CDT
VRTSvrpro-5.0.30.00-MP3_RHEL5.i686                          Mon 18 May 2009 10:15:23 AM CDT
VRTSvrw-5.0.30.00-MP3_RHEL5.i686                            Mon 18 May 2009 10:16:18 AM CDT
VRTSvxfen-5.0.30.20-MP3RP2_RHEL5.x86_64                     Mon 29 Jun 2009 09:15:45 AM CDT
VRTSvxfs-common-5.0.30.30-MP3RP3_RHEL5.i686                 Sat 19 Sep 2009 06:42:43 PM CDT
VRTSvxfs-platform-5.0.30.30-MP3RP3_RHEL5.x86_64             Sat 19 Sep 2009 06:42:45 PM CDT
VRTSvxmsa-4.4.028-RHEL4.i686                                Mon 18 May 2009 10:21:38 AM CDT
VRTSvxvm-common-5.0.30.30-MP3RP3_RHEL5.i686                 Sat 19 Sep 2009 06:42:47 PM CDT
VRTSvxvm-platform-5.0.30.30-MP3RP3_RHEL5.x86_64             Sat 19 Sep 2009 06:42:55 PM CDT
VRTSweb-5.0.1-GA4_GENERIC.noarch                            Mon 18 May 2009 10:16:05 AM CDT

Issue

  • High system load with kblockd threads seen in the top 20 processes in 'top' output, required system reboot.
  • Approximate timeline of the problem
3:02 problems started
4:15 run queue went from 5 to 60
had to manually reboot the box at 6:04
  • Our sar command shows us high %system and 0 %idle, between 04:20 and 04:30 CST
# sar
Linux 2.6.18-194.17.4.el5 (linuxsystem)   09/16/2012

12:00:02 AM       CPU     %user     %nice   %system   %iowait    %steal     %idle
12:10:01 AM       all      3.17      0.00      3.45      1.01      0.00     92.37
12:20:01 AM       all      2.08      0.00      2.85      0.75      0.00     94.32
12:30:02 AM       all      2.42      0.00      2.75      0.34      0.00     94.49
...
03:50:01 AM       all      4.26      0.00      8.31      2.46      0.00     84.97
04:00:02 AM       all      3.42      0.00      8.01      0.70      0.00     87.87
04:10:01 AM       all      4.17      0.07      9.64      2.36      0.00     83.75
04:20:02 AM       all      3.17      0.00     32.52      1.39      0.00     62.93
04:30:02 AM       all      2.50      0.01     97.39      0.00      0.00      0.10
04:40:03 AM       all      2.77      0.00     97.22      0.00      0.00      0.01
Average:          all      3.73      0.00     12.39      2.39      0.00     81.49

06:04:06 AM       LINUX RESTART
  • top output shows system was up for 230 days, load average over 40, and kblockd threads consuming almost 100% of CPU
top - 16:57:34 up 230 days, 20:10,  1 user,  load average: 43.95, 47.76, 49.36
Tasks: 1144 total,  50 running, 1067 sleeping,   0 stopped,  27 zombie
Cpu(s): 35.8%us,  5.7%sy,  0.0%ni, 56.5%id,  1.7%wa,  0.0%hi,  0.3%si,  0.0%st
Mem:  65875208k total, 65371948k used,   503260k free,   166584k buffers
Swap: 33554424k total,  2594600k used, 30959824k free, 56725016k cached

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND            
  645 root      12  -5     0    0    0 R 97.5  0.0   4981:18 kblockd/1          
  646 root      12  -5     0    0    0 R 97.5  0.0   4195:12 kblockd/2          
  650 root      15  -5     0    0    0 R 95.7  0.0   6156:41 kblockd/6          
  647 root      14  -5     0    0    0 R 93.9  0.0   4097:12 kblockd/3          
  662 root      14  -5     0    0    0 R 93.4  0.0   2691:44 kblockd/18         
  658 root      15  -5     0    0    0 R 92.5  0.0   2618:32 kblockd/14         
  663 root      14  -5     0    0    0 R 90.7  0.0   2547:15 kblockd/19         
  667 root      14  -5     0    0    0 R 89.3  0.0   2637:37 kblockd/23         
  661 root      14  -5     0    0    0 R 87.5  0.0   2972:21 kblockd/17         
  654 root      14  -5     0    0    0 R 87.1  0.0   2578:49 kblockd/10         
  660 root      13  -5     0    0    0 R 85.3  0.0   3074:13 kblockd/16         
  666 root      13  -5     0    0    0 R 79.9  0.0   2589:33 kblockd/22         
  655 root      11  -5     0    0    0 R 75.4  0.0   2445:31 kblockd/11         
  665 root      12  -5     0    0    0 R 74.5  0.0   2022:35 kblockd/21         
  664 root      10  -5     0    0    0 R 65.4  0.0   1321:50 kblockd/20         
  656 root      11  -5     0    0    0 R 63.6  0.0   1067:28 kblockd/12         
  657 root      11  -5     0    0    0 R 63.6  0.0   1752:55 kblockd/13         
  651 root      10  -5     0    0    0 R 60.5  0.0   5321:24 kblockd/7          

Root Cause

Diagnostic Steps

  • Look at the messages file around the time of the high CPU
Sep 16 02:35:56 linuxhost SQLAnywhere(veritas_mydb_linuxhost): Finished checkpoint of "vxdbms" (vxdbms.db) at Sun Sep 16 2012 02:35
Sep 16 02:55:56 linuxhost SQLAnywhere(veritas_mydb_linuxhost): Starting checkpoint of "vxdbms" (vxdbms.db) at Sun Sep 16 2012 02:55
Sep 16 02:55:56 linuxhost SQLAnywhere(veritas_mydb_linuxhost): Finished checkpoint of "vxdbms" (vxdbms.db) at Sun Sep 16 2012 02:55
Sep 16 03:01:56 linuxhost auditd[12094]: Audit daemon rotating log files
Sep 16 03:04:32 linuxhost kernel: program nmhs is using a deprecated SCSI ioctl, please convert it to SG_IO
Sep 16 03:04:33 linuxhost last message repeated 543 times
Sep 16 03:11:17 linuxhost sshd[1938]: Accepted publickey for sysstat from 10.200.130.34 port 38162 ssh2
Sep 16 03:11:17 linuxhost sshd[1938]: subsystem request for sftp
Sep 16 03:15:57 linuxhost SQLAnywhere(veritas_mydb_linuxhost): Starting checkpoint of "vxdbms" (vxdbms.db) at Sun Sep 16 2012 03:15
Sep 16 03:15:57 linuxhost SQLAnywhere(veritas_mydb_linuxhost): Finished checkpoint of "vxdbms" (vxdbms.db) at Sun Sep 16 2012 03:15
Sep 16 03:27:33 linuxhost ntpd[16098]: synchronized to 10.10.130.4, stratum 2
Sep 16 03:31:14 linuxhost sshd[8098]: Accepted publickey for sysstat from 10.200.130.34 port 43634 ssh2
Sep 16 03:31:14 linuxhost sshd[8098]: subsystem request for sftp
Sep 16 03:35:58 linuxhost SQLAnywhere(veritas_mydb_linuxhost): Starting checkpoint of "vxdbms" (vxdbms.db) at Sun Sep 16 2012 03:35
Sep 16 03:35:58 linuxhost SQLAnywhere(veritas_mydb_linuxhost): Finished checkpoint of "vxdbms" (vxdbms.db) at Sun Sep 16 2012 03:35
Sep 16 03:55:59 linuxhost SQLAnywhere(veritas_mydb_linuxhost): Starting checkpoint of "vxdbms" (vxdbms.db) at Sun Sep 16 2012 03:55
Sep 16 03:55:59 linuxhost SQLAnywhere(veritas_mydb_linuxhost): Finished checkpoint of "vxdbms" (vxdbms.db) at Sun Sep 16 2012 03:55
Sep 16 04:00:04 linuxhost ntpd[16098]: synchronized to 10.10.3.4, stratum 2
Sep 16 04:02:02 linuxhost syslogd 1.4.1: restart.
Sep 16 04:02:02 linuxhost logrotate: ALERT exited abnormally with [1]
Sep 16 04:04:39 linuxhost kernel: program nmhs is using a deprecated SCSI ioctl, please convert it to SG_IO
Sep 16 04:04:40 linuxhost last message repeated 543 times
Sep 16 04:15:59 linuxhost SQLAnywhere(veritas_mydb_linuxhost): Starting checkpoint of "vxdbms" (vxdbms.db) at Sun Sep 16 2012 04:15
Sep 16 04:15:59 linuxhost SQLAnywhere(veritas_mydb_linuxhost): Finished checkpoint of "vxdbms" (vxdbms.db) at Sun Sep 16 2012 04:15
Sep 16 04:18:09 linuxhost AgentFramework[20900]: VCS ERROR V-16-1-13027 Thread(4131388304) Resource(xyz_abc_dg) - monitor procedure did not complete within the expected time.
Sep 16 04:18:09 linuxhost Had[19805]: VCS ERROR V-16-1-13027 (linuxhost) Resource(xyz_abc_dg) - monitor procedure did not complete within the expected time.
Sep 16 04:18:20 linuxhost sshd[23727]: Accepted publickey for oracle from 1.2.3.4 port 33301 ssh2
Sep 16 04:18:22 linuxhost Had[19805]: VCS CRITICAL V-16-1-50086 CPU usage on linuxhost is 99%
Sep 16 04:18:38 linuxhost AgentFramework[15310]: VCS ERROR V-16-1-13027 Thread(4145650576) Resource(eapr_apuanfp1_db) - monitor procedure did not complete within the expected time.
Sep 16 04:18:38 linuxhost Had[19805]: VCS ERROR V-16-1-13027 (linuxhost) Resource(eapr_apuanfp1_db) - monitor procedure did not complete within the expected time.
Sep 16 04:18:52 linuxhost AgentFramework[15310]: VCS ERROR V-16-1-13027 Thread(4061731728) Resource(eao_arlanfp1_db) - monitor procedure did not complete within the expected time.
Sep 16 04:18:52 linuxhost AgentFramework[15310]: VCS ERROR V-16-1-13027 Thread(4082711440) Resource(eaf_temp1p1_db) - monitor procedure did not complete within the expected time.
Sep 16 04:18:52 linuxhost AgentFramework[15310]: VCS ERROR V-16-1-13027 Thread(4093201296) Resource(eas_asklefp1_db) - monitor procedure did not complete within the expected time.
Sep 16 04:18:52 linuxhost Had[19805]: VCS ERROR V-16-1-13027 (linuxhost) Resource(eaf_temp1p1_db) - monitor procedure did not complete within the expected time.
Sep 16 04:18:52 linuxhost Had[19805]: VCS ERROR V-16-1-13027 (linuxhost) Resource(eao_arlanfp1_db) - monitor procedure did not complete within the expected time.
Sep 16 04:18:52 linuxhost Had[19805]: VCS ERROR V-16-1-13027 (linuxhost) Resource(eas_asklefp1_db) - monitor procedure did not complete within the expected time.
Sep 16 04:18:53 linuxhost AgentFramework[15310]: VCS ERROR V-16-1-13027 Thread(4103691152) Resource(elm_lmlanfp1_db) - monitor procedure did not complete within the expected time.
Sep 16 04:18:53 linuxhost Had[19805]: VCS ERROR V-16-1-13027 (linuxhost) Resource(elm_lmlanfp1_db) - monitor procedure did not complete within the expected time.
Sep 16 04:18:53 linuxhost AgentFramework[15310]: VCS ERROR V-16-1-13027 Thread(4001364880) Resource(eaf_fsbdefp1_db) - monitor procedure did not complete within the expected time.
Sep 16 04:18:53 linuxhost Had[19805]: VCS ERROR V-16-1-13027 (linuxhost) Resource(eaf_fsbdefp1_db) - monitor procedure did not complete within the expected time.
Sep 16 04:18:54 linuxhost AgentFramework[15310]: VCS ERROR V-16-1-13027 Thread(4016036752) Resource(eaf_fsacnfp1_db) - monitor procedure did not complete within the expected time.
Sep 16 04:18:54 linuxhost Had[19805]: VCS ERROR V-16-1-13027 (linuxhost) Resource(eaf_fsacnfp1_db) - monitor procedure did not complete within the expected time.
Sep 16 04:18:56 linuxhost AgentFramework[15310]: VCS ERROR V-16-1-13027 Thread(4114181008) Resource(eaf_fslanfp1_db) - monitor procedure did not complete within the expected time.

This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.