Issue Summary -CPU Spike during DB2 database backup
Issue
DB2 Database server becomes unresponsive.
The CPU run queue captured by vmstat will show the normal handful of processes waiting jump into the hundreds or thousands.
CPU will eventually spike as well, but not necessarily right away.
During these events, straces have shown an increased number of process doing busy-wait loops and asking to be put back in the run-queue, as opposed to normal.
18217 15:34:20.497739 select(0, NULL, NULL, NULL, {0, 1000} <unfinished ...>
18217 15:34:20.507766 <... select resumed> ) = 0 (Timeout)
18217 15:34:20.521358 sched_yield( <unfinished ...>
18217 15:34:20.531511 <... sched_yield resumed> ) = 0
18217 15:34:20.541386 sched_yield( <unfinished ...>
18217 15:34:20.551405 <... sched_yield resumed> ) = 0
18217 15:34:20.561107 sched_yield( <unfinished ...>
18217 15:34:20.571119 <... sched_yield resumed> ) = 0
18217 15:34:20.581060 sched_yield( <unfinished ...>
18217 15:34:20.590966 <... sched_yield resumed> ) = 0
18217 15:34:20.601051 sched_yield( <unfinished ...>
18217 15:34:20.610843 <... sched_yield resumed> ) = 0
18217 15:34:20.620418 select(0, NULL, NULL, NULL, {0, 1000} <unfinished ...>
18217 15:34:20.630127 <... select resumed> ) = 0 (Timeout)
18217 15:34:20.640086 sched_yield( <unfinished ...>
18217 15:34:20.658969 <... sched_yield resumed> ) = 0
18217 15:34:20.668837 sched_yield( <unfinished ...>
18217 15:34:20.678556 <... sched_yield resumed> ) = 0
18217 15:34:20.688694 sched_yield( <unfinished ...>
18217 15:34:20.698502 <... sched_yield resumed> ) = 0
18217 15:34:20.708221 sched_yield( <unfinished ...>
18217 15:34:20.718445 <... sched_yield resumed> ) = 0
18217 15:34:20.728176 sched_yield( <unfinished ...>
18217 15:34:20.738251 <... sched_yield resumed> ) = 0
18217 15:34:20.748080 select(0, NULL, NULL, NULL, {0, 1000} <unfinished ...>
18217 15:34:20.759658 <... select resumed> ) = 0 (Timeout)
18217 15:34:20.769284 sched_yield( <unfinished ...>
18217 15:34:20.788357 <... sched_yield resumed> ) = 0
18217 15:34:20.798316 sched_yield( <unfinished ...>
18217 15:34:20.807992 <... sched_yield resumed> ) = 0
18217 15:34:20.817618 sched_yield( <unfinished ...>
18217 15:34:20.827431 <... sched_yield resumed> ) = 0
18217 15:34:20.837477 sched_yield( <unfinished ...>
18217 15:34:20.851028 <... sched_yield resumed> ) = 0
18217 15:34:20.860788 sched_yield( <unfinished ...>
18217 15:34:20.870275 <... sched_yield resumed> ) = 0
18217 15:34:20.880015 select(0, NULL, NULL, NULL, {0, 1000} <unfinished ...>
18217 15:34:20.893136 <... select resumed> ) = 0 (Timeout)
Current Action Plan
This is looking very much like an application issue.
Issue can correct itself or go away. Customer's current off-hours efforts are to gather an strace as the issue is going away and the server is normalizing.
The hope is to find out what why these processes are spinning in a busy-wait loop and what they're doing after that gets them out of it.
Environment
- RHEL 5.7 (2.6.18.274.el5)
- DB2 UDB
- DB2 filesystems are on EMC SAN storage, with a Veritas Vxfs filesystem. (... using 1k blocksize)
- HP bl460g6 2-socket 4-core hyperthreaded server - Version: Intel(R) Xeon(R) CPU X5560 @ 2.80GHz
Subscriber exclusive content
A Red Hat subscription provides unlimited access to our knowledgebase of over 48,000 articles and solutions.
Welcome! Check out the Getting Started with Red Hat page for quick tours and guides for common tasks.
