Server went hung state

Solution Unverified - Updated -

Environment

  • Red Hat Enterprise 5.6

Issue

  • Server goes on hung state and not able open putty session.

Resolution

Check the load on the host system.

Root Cause

What's happened here is that a process - in this case a filesystem journal thread - was delayed for more than 2 minutes while it waited for I/O to complete which caused the hung task watchdog to trigger. Since the hung_task_panic setting was enabled the system dumped core. The reason the process was delayed was because the device /dev/sda has a full queue and is very slow to complete I/O requests. There is also evidence that shows the guest is running low on available memory and is swapping to disk. The slow I/O is unlikely to be caused by the guest but rather a busy host system is starving the guest of throughput.

Diagnostic Steps

Looking at the vmcore:

      KERNEL: usr/lib/debug/lib/modules/2.6.18-238.el5PAE/vmlinux
    DUMPFILE: vmcore
        CPUS: 4
        DATE: Mon Feb  3 20:05:32 2014
      UPTIME: 2 days, 18:13:05
LOAD AVERAGE: 6.74, 5.23, 3.42
       TASKS: 193
    NODENAME: <hostname>
     RELEASE: 2.6.18-238.el5PAE
     VERSION: #1 SMP Sun Dec 19 14:42:44 EST 2010
     MACHINE: i686  (2666 Mhz)
      MEMORY: 9 GB
       PANIC: "Kernel panic - not syncing: hung_task: blocked tasks"

INFO: task kjournald:654 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
kjournald     D 0000D823  2860   654     15           680   618 (L-TLB)
       f7db9ed4 00000046 855989f2 0000d823 f1bc4e94 00000000 053de60d 0000000a 
       f7888550 85599e7a 0000d823 00001488 00000001 f788865c cd819744 f16d3e40 
       1adf714e f788ae84 00000000 f788ae9c 1adf714e c042d4db f788ae7c 1adf714e 
Call Trace:
 [<c042d4db>] getnstimeofday+0x30/0xb6
 [<c061ede6>] io_schedule+0x36/0x59
 [<c04786f5>] sync_buffer+0x30/0x33
 [<c061efbd>] __wait_on_bit+0x33/0x58
 [<c04786c5>] sync_buffer+0x0/0x33
 [<c04786c5>] sync_buffer+0x0/0x33
 [<c061f044>] out_of_line_wait_on_bit+0x62/0x6a
 [<c0436fec>] wake_bit_function+0x0/0x3c
 [<c0478672>] __wait_on_buffer+0x1c/0x1f
 [<f8866422>] journal_commit_transaction+0x4be/0xefc [jbd]
 [<c042e32b>] lock_timer_base+0x15/0x2f
 [<c042e3aa>] try_to_del_timer_sync+0x65/0x6c
 [<f8869c21>] kjournald+0xa1/0x1c2 [jbd]
 [<c0436fbf>] autoremove_wake_function+0x0/0x2d
 [<f8869b80>] kjournald+0x0/0x1c2 [jbd]
 [<c0436efb>] kthread+0xc0/0xed
 [<c0436e3b>] kthread+0x0/0xed
 [<c0405c87>] kernel_thread_helper+0x7/0x10
 =======================
Kernel panic - not syncing: hung_task: blocked tasks

crash> set 654
    PID: 654
COMMAND: "kjournald"
   TASK: f7888550  [THREAD_INFO: f7db9000]
    CPU: 1
  STATE: TASK_UNINTERRUPTIBLE 

crash> bt 
PID: 654    TASK: f7888550  CPU: 1   COMMAND: "kjournald"
 #0 [f7db9e60] schedule at c061e742
 #1 [f7db9ed8] io_schedule at c061ede1
 #2 [f7db9ee4] sync_buffer at c04786f0
 #3 [f7db9ee8] __wait_on_bit at c061efba
 #4 [f7db9f00] out_of_line_wait_on_bit at c061f03f
 #5 [f7db9f38] __wait_on_buffer at c047866d
 #6 [f7db9f44] journal_commit_transaction at f886641d [jbd]
 #7 [f7db9fa8] kjournald at f8869c1c [jbd]
 #8 [f7db9fcc] kthread at c0436ef9
 #9 [f7db9fe4] kernel_thread_helper at c0405c85

crash> task -R last_ran
PID: 654    TASK: f7888550  CPU: 1   COMMAND: "kjournald"
  last_ran = 237647072697978, 

crash> runq | grep "CPU 1"
CPU 1 RUNQUEUE: cd819744
crash> rq.timestamp_last_tick cd819744
  timestamp_last_tick = 237773097130245
crash> pd (237773097130245-237647072697978)
$2 = 126024432267
crash> pd (237773097130245-237647072697978)/1000000000
$3 = 126

Stalled for 126 seconds, where?

    /*
     * Wait for all previously submitted IO to complete.
     */
    spin_lock(&journal->j_list_lock);
    while (commit_transaction->t_locked_list) {
        struct buffer_head *bh;

        jh = commit_transaction->t_locked_list->b_tprev;
        bh = jh2bh(jh);
        get_bh(bh);
        if (buffer_locked(bh)) {
            spin_unlock(&journal->j_list_lock);
            wait_on_buffer(bh);                    <----

It's waiting for I/O to complete. It's flushing the dirty data on a file before it can commit the transaction that involves that inode.

Checkout what the storage is doing:

crash> whatis journal_commit_transaction
void journal_commit_transaction(journal_t *);

crash> dis -rl f8869c1c
...
/usr/src/debug/kernel-2.6.18/linux-2.6.18.i686/fs/jbd/journal.c: 151
0xf8869c1a <kjournald+154>:     mov    %esi,%eax
0xf8869c1c <kjournald+156>:     call   0xf8865f64 <journal_commit_transaction>

crash> dis -rl f886641d
/usr/src/debug/kernel-2.6.18/linux-2.6.18.i686/fs/jbd/commit.c: 303
0xf8865f64 <journal_commit_transaction>:        push   %ebp
0xf8865f65 <journal_commit_transaction+1>:      mov    %eax,%ebp
0xf8865f67 <journal_commit_transaction+3>:      push   %edi
0xf8865f68 <journal_commit_transaction+4>:      push   %esi
...

crash> bt -f
PID: 654    TASK: f7888550  CPU: 1   COMMAND: "kjournald"
...
 #6 [f7db9f44] journal_commit_transaction at f886641d [jbd]
    [RA: f8869c21  SP: f7db9f48  FP: f7db9fa8  SIZE: 100]
    f7db9f48: eedb21c0  00000000  cdb66000  00000000  
    f7db9f58: 00000000  dfda8034  00000fcc  00000000  
    f7db9f68: cdb66000  00000005  f7ca2650  00000000  
    f7db9f78: 00000286  c042e32b  ffffffff  f7ca26ec  
    f7db9f88: 00000286  c042e3aa  00000286  f7ca26ec  
    f7db9f98: f7ca2614  f7ca2600  f7db9fc0  00000000  
    f7db9fa8: f8869c21  
 #7 [f7db9fa8] kjournald at f8869c1c [jbd]
...

crash> journal_t.j_dev,j_fs_dev f7ca2600
  j_dev = 0xcd90da40
  j_fs_dev = 0xcd90da40

crash> block_device.bd_disk 0xcd90da40
  bd_disk = 0xcdbd1780

crash> gendisk.disk_name,queue,in_flight 0xcdbd1780
  disk_name = "sda\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000"
  queue = 0xcdbd9be0
  in_flight = 71

crash> request_queue.rq,queuedata,queue_flags,nr_sorted,in_flight 0xcdbd9be0
  rq = {
    count = {2, 69}, 
    starved = {0, 0}, 
    elvpriv = 71, 
...
  queuedata = 0xcdbd8c00
  queue_flags = 32769
  nr_sorted = 37
  in_flight = 32

crash> dev -d
MAJOR  GENDISK     NAME        REQUEST_QUEUE  TOTAL   READ  WRITE    DRV
    8  cdbd1780    sda         cdbd9be0          71      2     69     32
    2  cda93ac0    fd0         cdbd90ac           0      0      0      0
    3  cdaaa3c0    hda         cda24ba0           0      0      0      0
    9  cdaaa7c0    md0         cdbd9824           0      0      0      0
  253  f1bc67c0    dm-0        f16ecb60           0      0      0      0

Looks like slow or stalled storage, will need to check the timestamp on the oldest inflight I/O.

Check the timestamps when the current inflight I/Os were issued:

crash> scsi_device.cmd_list,vendor 0xcdbd8c00
  cmd_list = {
    next = 0xf2cfb6c4, 
    prev = 0xcdbd36c4
  }
  vendor = 0xcdbcdbc8 "VMware  Virtual disk    1.0 "

crash> list -H cdbd8c20 -o scsi_cmnd.list -s scsi_cmnd.jiffies_at_alloc,request
f2cfb6c0                         // oldest command
  jiffies_at_alloc = 238022696
  request = 0xe52131d0
...
cdbd36c0                         // newest command
  jiffies_at_alloc = 238085483
  request = 0xf1bc4328

crash> jiffies
jiffies = $16 = 238085642

crash> pd (238085642-238022696)/1000
$17 = 62

crash> pd (238085642-238085483)/1000
$18 = 0

Oldest command in the queue was issued 62 seconds ago, newest was just issued. 62 seconds is a long time for a queue depth of 32 - it implies the device is only satisfying 2 I/Os per second.

crash> request.start_time 0xe52131d0
  start_time = 237848662

crash> request.start_time 0xf1bc4328
  start_time = 237848876

crash> pd (238085642-237848662)/1000
$19 = 236

crash> pd (238085642-237848876)/1000
$20 = 236

All requests for inflight I/O were created 236 seconds ago. This means the requests were stalled for 3-4 minutes before finally being handed off to the device driver.

crash> request.bio 0xe52131d0
  bio = 0xed15bec0

crash> bio.bi_end_io 0xed15bec0
  bi_end_io = 0xc046b153 <end_swap_bio_write>

crash> request.bio 0xf1bc4328
  bio = 0xe1e51c40

crash> bio.bi_end_io 0xe1e51c40
  bi_end_io = 0xc046b153 <end_swap_bio_write>

Oldest and newest commands are both paging to the swap device so free memory is low on this system.

This is a virtual machine sending commands to a virtual storage device (probably a file-backed disk) and a high load on the host system may be starving this guest.

This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.

Close

Welcome! Check out the Getting Started with Red Hat page for quick tours and guides for common tasks.