[RHEL5.5] kblockd high cpu - required reboot
Environment
- Red Hat Enterprise Linux 5.5
- kernel 2.6.18-194.17.4.el5
- Large IO subsystem (/proc/partitions almost 7000 entries)
- EMC Powerpath
EMCpower.LINUX-5.3.1.00.00-111.x86_64
- Veritas VCS, Veritas DMP, etc
VRTSaa-5.0.602.0-0.i686 Mon 18 May 2009 10:11:41 AM CDT
VRTSacclib-5.0.30.00-MP3_GENERIC.noarch Mon 18 May 2009 10:18:36 AM CDT
VRTSalloc-5.0.30.00-MP3_RHEL5.i686 Mon 18 May 2009 10:14:42 AM CDT
VRTSatClient-4.3.34.4-4.i386 Mon 18 May 2009 10:09:08 AM CDT
VRTSatServer-4.3.34.4-4.i386 Mon 18 May 2009 10:09:45 AM CDT
VRTScavf-5.0.30.20-MP3RP2a_GENERIC.noarch Mon 29 Jun 2009 09:16:02 AM CDT
VRTSccg-5.0.602.0-0.i686 Mon 18 May 2009 10:11:14 AM CDT
VRTScmccc-5.0.30.00-MP3_RHEL5.i686 Mon 18 May 2009 10:20:40 AM CDT
VRTScmcs-5.0.30.00-MP3_RHEL5.i686 Mon 18 May 2009 10:20:27 AM CDT
VRTScscm-5.0.30.00-MP3_GENERIC.noarch Mon 18 May 2009 10:19:47 AM CDT
VRTScscw-5.0.30.00-MP3_GENERIC.noarch Mon 18 May 2009 10:19:34 AM CDT
VRTScssim-5.0.30.00-MP3_RHEL5.i686 Mon 18 May 2009 10:20:01 AM CDT
VRTScutil-5.0-MP3_GENERIC.noarch Mon 18 May 2009 10:20:14 AM CDT
VRTSdbcom-common-5.0.30.20-MP3RP2_RHEL5.i686 Mon 29 Jun 2009 09:14:12 AM CDT
VRTSdbed-common-5.0.30.20-MP3RP2_RHEL5.i686 Mon 29 Jun 2009 09:14:06 AM CDT
VRTSdbms3-3.0.119.0-3.0.119.0.x86_64 Mon 18 May 2009 10:17:11 AM CDT
VRTSdcli-5.0.30.20-GA_MP3RP2_RHEL.i686 Mon 29 Jun 2009 09:15:48 AM CDT
VRTSddlpr-5.0.30.20-GA_MP3_RHEL5.i686 Mon 29 Jun 2009 09:15:47 AM CDT
VRTSdsa-5.0.00.00-GA_RHEL4.i686 Mon 18 May 2009 10:13:14 AM CDT
VRTSfsman-5.0.30.00-MP3_GENERIC.noarch Mon 18 May 2009 10:16:31 AM CDT
VRTSfsmnd-5.0.30.00-MP3_GENERIC.noarch Mon 18 May 2009 10:16:57 AM CDT
VRTSfspro-5.0.30.00-MP3_RHEL5.i686 Mon 18 May 2009 10:13:27 AM CDT
VRTSfssdk-5.0.30.30-MP3RP3_RHEL5.i686 Sat 19 Sep 2009 06:42:45 PM CDT
VRTSgab-5.0.30.20-MP3RP2_RHEL5.x86_64 Mon 29 Jun 2009 09:15:45 AM CDT
VRTSicsco-1.3.28.0-0.i386 Mon 18 May 2009 10:08:42 AM CDT
VRTSjre15-1.5.3.5-5.i386 Mon 18 May 2009 10:15:52 AM CDT
VRTSllt-5.0.30.20-MP3RP2_RHEL5.x86_64 Mon 29 Jun 2009 09:15:45 AM CDT
VRTSlvmconv-5.0.30.30-MP3RP3_RHEL5.i686 Sat 19 Sep 2009 06:42:54 PM CDT
VRTSmh-5.0.527.0-0.i686 Mon 18 May 2009 10:11:28 AM CDT
VRTSob-3.3.1202.0-0.i686 Mon 18 May 2009 10:10:43 AM CDT
VRTSobc33-3.3.1202.0-0.i686 Mon 18 May 2009 10:10:29 AM CDT
VRTSobgui-3.3.1202.0-0.i686 Mon 18 May 2009 10:11:00 AM CDT
VRTSodm-common-5.0.30.00-MP3_RHEL5.i686 Mon 18 May 2009 10:22:04 AM CDT
VRTSodm-platform-5.0.30.30-MP3RP3_RHEL5.x86_64 Sat 19 Sep 2009 06:42:54 PM CDT
VRTSorgui-common-5.0.30.20-MP3RP2_RHEL5.i686 Mon 29 Jun 2009 09:14:08 AM CDT
VRTSpbx-1.3.28.0-0.i386 Mon 18 May 2009 10:08:54 AM CDT
VRTSperl-5.8.8.0-RHEL5.x86_64 Mon 18 May 2009 10:08:27 AM CDT
VRTSspt-5.0.00.4-GA.noarch Mon 18 May 2009 10:11:55 AM CDT
VRTSvcs-5.0.30.20-MP3RP2_RHEL5.i686 Mon 29 Jun 2009 09:16:01 AM CDT
VRTSvcsag-5.0.30.20-MP3RP2_RHEL5.i686 Mon 29 Jun 2009 09:16:02 AM CDT
VRTSvcsdr-5.0.30.00-MP3_RHEL5.x86_64 Mon 18 May 2009 10:19:07 AM CDT
VRTSvcsmg-5.0.30.00-MP3_GENERIC.noarch Mon 18 May 2009 10:18:23 AM CDT
VRTSvcsmn-5.0.30.00-MP3_GENERIC.noarch Mon 18 May 2009 10:19:20 AM CDT
VRTSvcsor-5.0.30.00-MP3_RHEL5.i686 Mon 18 May 2009 10:21:51 AM CDT
VRTSvcsvr-5.0.30.00-MP3_GENERIC.noarch Mon 18 May 2009 10:15:37 AM CDT
VRTSvdid-1.2.206.0-206.i386 Mon 18 May 2009 10:14:56 AM CDT
VRTSvlic-3.02.33.5500-0.x86_64 Mon 18 May 2009 10:05:58 AM CDT
VRTSvmman-5.0.30.30-MP3RP3_GENERIC.noarch Sat 19 Sep 2009 06:42:54 PM CDT
VRTSvmpro-5.0.30.20-GA_RHEL5.i686 Mon 29 Jun 2009 09:16:08 AM CDT
VRTSvrpro-5.0.30.00-MP3_RHEL5.i686 Mon 18 May 2009 10:15:23 AM CDT
VRTSvrw-5.0.30.00-MP3_RHEL5.i686 Mon 18 May 2009 10:16:18 AM CDT
VRTSvxfen-5.0.30.20-MP3RP2_RHEL5.x86_64 Mon 29 Jun 2009 09:15:45 AM CDT
VRTSvxfs-common-5.0.30.30-MP3RP3_RHEL5.i686 Sat 19 Sep 2009 06:42:43 PM CDT
VRTSvxfs-platform-5.0.30.30-MP3RP3_RHEL5.x86_64 Sat 19 Sep 2009 06:42:45 PM CDT
VRTSvxmsa-4.4.028-RHEL4.i686 Mon 18 May 2009 10:21:38 AM CDT
VRTSvxvm-common-5.0.30.30-MP3RP3_RHEL5.i686 Sat 19 Sep 2009 06:42:47 PM CDT
VRTSvxvm-platform-5.0.30.30-MP3RP3_RHEL5.x86_64 Sat 19 Sep 2009 06:42:55 PM CDT
VRTSweb-5.0.1-GA4_GENERIC.noarch Mon 18 May 2009 10:16:05 AM CDT
Issue
- High system load with kblockd threads seen in the top 20 processes in 'top' output, required system reboot.
- Approximate timeline of the problem
3:02 problems started
4:15 run queue went from 5 to 60
had to manually reboot the box at 6:04
- Our sar command shows us high %system and 0 %idle, between 04:20 and 04:30 CST
# sar
Linux 2.6.18-194.17.4.el5 (linuxsystem) 09/16/2012
12:00:02 AM CPU %user %nice %system %iowait %steal %idle
12:10:01 AM all 3.17 0.00 3.45 1.01 0.00 92.37
12:20:01 AM all 2.08 0.00 2.85 0.75 0.00 94.32
12:30:02 AM all 2.42 0.00 2.75 0.34 0.00 94.49
...
03:50:01 AM all 4.26 0.00 8.31 2.46 0.00 84.97
04:00:02 AM all 3.42 0.00 8.01 0.70 0.00 87.87
04:10:01 AM all 4.17 0.07 9.64 2.36 0.00 83.75
04:20:02 AM all 3.17 0.00 32.52 1.39 0.00 62.93
04:30:02 AM all 2.50 0.01 97.39 0.00 0.00 0.10
04:40:03 AM all 2.77 0.00 97.22 0.00 0.00 0.01
Average: all 3.73 0.00 12.39 2.39 0.00 81.49
06:04:06 AM LINUX RESTART
- top output shows system was up for 230 days, load average over 40, and kblockd threads consuming almost 100% of CPU
top - 16:57:34 up 230 days, 20:10, 1 user, load average: 43.95, 47.76, 49.36
Tasks: 1144 total, 50 running, 1067 sleeping, 0 stopped, 27 zombie
Cpu(s): 35.8%us, 5.7%sy, 0.0%ni, 56.5%id, 1.7%wa, 0.0%hi, 0.3%si, 0.0%st
Mem: 65875208k total, 65371948k used, 503260k free, 166584k buffers
Swap: 33554424k total, 2594600k used, 30959824k free, 56725016k cached
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
645 root 12 -5 0 0 0 R 97.5 0.0 4981:18 kblockd/1
646 root 12 -5 0 0 0 R 97.5 0.0 4195:12 kblockd/2
650 root 15 -5 0 0 0 R 95.7 0.0 6156:41 kblockd/6
647 root 14 -5 0 0 0 R 93.9 0.0 4097:12 kblockd/3
662 root 14 -5 0 0 0 R 93.4 0.0 2691:44 kblockd/18
658 root 15 -5 0 0 0 R 92.5 0.0 2618:32 kblockd/14
663 root 14 -5 0 0 0 R 90.7 0.0 2547:15 kblockd/19
667 root 14 -5 0 0 0 R 89.3 0.0 2637:37 kblockd/23
661 root 14 -5 0 0 0 R 87.5 0.0 2972:21 kblockd/17
654 root 14 -5 0 0 0 R 87.1 0.0 2578:49 kblockd/10
660 root 13 -5 0 0 0 R 85.3 0.0 3074:13 kblockd/16
666 root 13 -5 0 0 0 R 79.9 0.0 2589:33 kblockd/22
655 root 11 -5 0 0 0 R 75.4 0.0 2445:31 kblockd/11
665 root 12 -5 0 0 0 R 74.5 0.0 2022:35 kblockd/21
664 root 10 -5 0 0 0 R 65.4 0.0 1321:50 kblockd/20
656 root 11 -5 0 0 0 R 63.6 0.0 1067:28 kblockd/12
657 root 11 -5 0 0 0 R 63.6 0.0 1752:55 kblockd/13
651 root 10 -5 0 0 0 R 60.5 0.0 5321:24 kblockd/7
Root Cause
- Unknown.
- Could be another instance of Frequent soft lockups in kblockd with EMC Powerpath.
Diagnostic Steps
- Look at the messages file around the time of the high CPU
Sep 16 02:35:56 linuxhost SQLAnywhere(veritas_mydb_linuxhost): Finished checkpoint of "vxdbms" (vxdbms.db) at Sun Sep 16 2012 02:35
Sep 16 02:55:56 linuxhost SQLAnywhere(veritas_mydb_linuxhost): Starting checkpoint of "vxdbms" (vxdbms.db) at Sun Sep 16 2012 02:55
Sep 16 02:55:56 linuxhost SQLAnywhere(veritas_mydb_linuxhost): Finished checkpoint of "vxdbms" (vxdbms.db) at Sun Sep 16 2012 02:55
Sep 16 03:01:56 linuxhost auditd[12094]: Audit daemon rotating log files
Sep 16 03:04:32 linuxhost kernel: program nmhs is using a deprecated SCSI ioctl, please convert it to SG_IO
Sep 16 03:04:33 linuxhost last message repeated 543 times
Sep 16 03:11:17 linuxhost sshd[1938]: Accepted publickey for sysstat from 10.200.130.34 port 38162 ssh2
Sep 16 03:11:17 linuxhost sshd[1938]: subsystem request for sftp
Sep 16 03:15:57 linuxhost SQLAnywhere(veritas_mydb_linuxhost): Starting checkpoint of "vxdbms" (vxdbms.db) at Sun Sep 16 2012 03:15
Sep 16 03:15:57 linuxhost SQLAnywhere(veritas_mydb_linuxhost): Finished checkpoint of "vxdbms" (vxdbms.db) at Sun Sep 16 2012 03:15
Sep 16 03:27:33 linuxhost ntpd[16098]: synchronized to 10.10.130.4, stratum 2
Sep 16 03:31:14 linuxhost sshd[8098]: Accepted publickey for sysstat from 10.200.130.34 port 43634 ssh2
Sep 16 03:31:14 linuxhost sshd[8098]: subsystem request for sftp
Sep 16 03:35:58 linuxhost SQLAnywhere(veritas_mydb_linuxhost): Starting checkpoint of "vxdbms" (vxdbms.db) at Sun Sep 16 2012 03:35
Sep 16 03:35:58 linuxhost SQLAnywhere(veritas_mydb_linuxhost): Finished checkpoint of "vxdbms" (vxdbms.db) at Sun Sep 16 2012 03:35
Sep 16 03:55:59 linuxhost SQLAnywhere(veritas_mydb_linuxhost): Starting checkpoint of "vxdbms" (vxdbms.db) at Sun Sep 16 2012 03:55
Sep 16 03:55:59 linuxhost SQLAnywhere(veritas_mydb_linuxhost): Finished checkpoint of "vxdbms" (vxdbms.db) at Sun Sep 16 2012 03:55
Sep 16 04:00:04 linuxhost ntpd[16098]: synchronized to 10.10.3.4, stratum 2
Sep 16 04:02:02 linuxhost syslogd 1.4.1: restart.
Sep 16 04:02:02 linuxhost logrotate: ALERT exited abnormally with [1]
Sep 16 04:04:39 linuxhost kernel: program nmhs is using a deprecated SCSI ioctl, please convert it to SG_IO
Sep 16 04:04:40 linuxhost last message repeated 543 times
Sep 16 04:15:59 linuxhost SQLAnywhere(veritas_mydb_linuxhost): Starting checkpoint of "vxdbms" (vxdbms.db) at Sun Sep 16 2012 04:15
Sep 16 04:15:59 linuxhost SQLAnywhere(veritas_mydb_linuxhost): Finished checkpoint of "vxdbms" (vxdbms.db) at Sun Sep 16 2012 04:15
Sep 16 04:18:09 linuxhost AgentFramework[20900]: VCS ERROR V-16-1-13027 Thread(4131388304) Resource(xyz_abc_dg) - monitor procedure did not complete within the expected time.
Sep 16 04:18:09 linuxhost Had[19805]: VCS ERROR V-16-1-13027 (linuxhost) Resource(xyz_abc_dg) - monitor procedure did not complete within the expected time.
Sep 16 04:18:20 linuxhost sshd[23727]: Accepted publickey for oracle from 1.2.3.4 port 33301 ssh2
Sep 16 04:18:22 linuxhost Had[19805]: VCS CRITICAL V-16-1-50086 CPU usage on linuxhost is 99%
Sep 16 04:18:38 linuxhost AgentFramework[15310]: VCS ERROR V-16-1-13027 Thread(4145650576) Resource(eapr_apuanfp1_db) - monitor procedure did not complete within the expected time.
Sep 16 04:18:38 linuxhost Had[19805]: VCS ERROR V-16-1-13027 (linuxhost) Resource(eapr_apuanfp1_db) - monitor procedure did not complete within the expected time.
Sep 16 04:18:52 linuxhost AgentFramework[15310]: VCS ERROR V-16-1-13027 Thread(4061731728) Resource(eao_arlanfp1_db) - monitor procedure did not complete within the expected time.
Sep 16 04:18:52 linuxhost AgentFramework[15310]: VCS ERROR V-16-1-13027 Thread(4082711440) Resource(eaf_temp1p1_db) - monitor procedure did not complete within the expected time.
Sep 16 04:18:52 linuxhost AgentFramework[15310]: VCS ERROR V-16-1-13027 Thread(4093201296) Resource(eas_asklefp1_db) - monitor procedure did not complete within the expected time.
Sep 16 04:18:52 linuxhost Had[19805]: VCS ERROR V-16-1-13027 (linuxhost) Resource(eaf_temp1p1_db) - monitor procedure did not complete within the expected time.
Sep 16 04:18:52 linuxhost Had[19805]: VCS ERROR V-16-1-13027 (linuxhost) Resource(eao_arlanfp1_db) - monitor procedure did not complete within the expected time.
Sep 16 04:18:52 linuxhost Had[19805]: VCS ERROR V-16-1-13027 (linuxhost) Resource(eas_asklefp1_db) - monitor procedure did not complete within the expected time.
Sep 16 04:18:53 linuxhost AgentFramework[15310]: VCS ERROR V-16-1-13027 Thread(4103691152) Resource(elm_lmlanfp1_db) - monitor procedure did not complete within the expected time.
Sep 16 04:18:53 linuxhost Had[19805]: VCS ERROR V-16-1-13027 (linuxhost) Resource(elm_lmlanfp1_db) - monitor procedure did not complete within the expected time.
Sep 16 04:18:53 linuxhost AgentFramework[15310]: VCS ERROR V-16-1-13027 Thread(4001364880) Resource(eaf_fsbdefp1_db) - monitor procedure did not complete within the expected time.
Sep 16 04:18:53 linuxhost Had[19805]: VCS ERROR V-16-1-13027 (linuxhost) Resource(eaf_fsbdefp1_db) - monitor procedure did not complete within the expected time.
Sep 16 04:18:54 linuxhost AgentFramework[15310]: VCS ERROR V-16-1-13027 Thread(4016036752) Resource(eaf_fsacnfp1_db) - monitor procedure did not complete within the expected time.
Sep 16 04:18:54 linuxhost Had[19805]: VCS ERROR V-16-1-13027 (linuxhost) Resource(eaf_fsacnfp1_db) - monitor procedure did not complete within the expected time.
Sep 16 04:18:56 linuxhost AgentFramework[15310]: VCS ERROR V-16-1-13027 Thread(4114181008) Resource(eaf_fslanfp1_db) - monitor procedure did not complete within the expected time.
This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.
Comments