Oracle node rebooted due to the stalled IO requests on few paths to SAN devices

Solution Verified - Updated -

Issue

  • The Oracle RAC nodes are having 8 paths with each LUN, but during the failure of only few of the sub paths, IO requests on voting disks were stalled and Oracle cluster had evicted the node out of cluster:

    ccsdagent process had triggered SysRq and crashed the node:

    BUG: unable to handle kernel NULL pointer dereference at (null)
    IP: [<ffffffff8135ca26>] sysrq_handle_crash+0x16/0x20
    PGD 4c370d5067 PUD 4895815067 PMD 0 
    Oops: 0002 [#1] SMP 
    [...]Pid: 37192, comm: cssdagent Tainted: P        W  -- ------------    2.6.32-642.15.1.el6.x86_64 #1 HP ProLiant BL460c Gen9
    RIP: 0010:[<ffffffff8135ca26>]  [<ffffffff8135ca26>] sysrq_handle_crash+0x16/0x20
    RSP: 0018:ffff884889d87e18  EFLAGS: 00010092
    RAX: 0000000000000010 RBX: 0000000000000063 RCX: 000000000000894e
    RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000063
    RBP: ffff884889d87e18 R08: 0000000000000000 R09: 0000000000000000
    R10: 0000000000000001 R11: 0000000000000293 R12: 0000000000000000
    R13: ffffffff81b12f60 R14: 0000000000000286 R15: 0000000000000004
    FS:  00007f6eab0e3700(0000) GS:ffff880291800000(0000) knlGS:0000000000000000
    CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    CR2: 0000000000000000 CR3: 0000004b8bbba000 CR4: 00000000001407e0
    DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
    DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
    Process cssdagent (pid: 37192, threadinfo ffff884889d84000, task ffff884bfab51520)
    Stack:
     ffff884889d87e68 ffffffff8135cce2 ffff884b982bba28 0000000000000000
    <d> 0000000000000018 0000000000000001 ffff884baf9d7cc0 00007f6eb1692b14
    <d> 0000000000000001 fffffffffffffffb ffff884889d87e98 ffffffff8135cd9e
    Call Trace:
     [<ffffffff8135cce2>] __handle_sysrq+0x132/0x1a0
     [<ffffffff8135cd9e>] write_sysrq_trigger+0x4e/0x50
     [<ffffffff8120595e>] proc_reg_write+0x7e/0xc0
     [<ffffffff81199e48>] vfs_write+0xb8/0x1a0
     [<ffffffff8119b35f>] ? fget_light_pos+0x3f/0x50
     [<ffffffff8119a981>] sys_write+0x51/0xb0
     [<ffffffff810ee43e>] ? __audit_syscall_exit+0x25e/0x290
     [<ffffffff8100b0d2>] system_call_fastpath+0x16/0x1b
    Code: d0 88 81 e3 eb ff 81 c9 c3 66 66 66 2e 0f 1f 84 00 00 00 00 00 55 48 89 e5 0f 1f 44 00 00 c7 05 8d f3 74 00 01 00 00 00 0f ae f8 <c6> 04 25 00 00 00 00 01 c9 c3 55 48 89 e5 0f 1f 44 00 00 8d 47 
    RIP  [<ffffffff8135ca26>] sysrq_handle_crash+0x16/0x20
     RSP <ffff884889d87e18>
    CR2: 0000000000000000
    

Environment

  • Red Hat Enterprise Linux 6.9
    • kernel-2.6.32-642.15.1.el6.x86_64
  • Red Hat Enterprise Linux 7.4
    • kernel version < 3.10.0-862.el7
  • Broadcom BCM57840 partial h/w offload FCoE interfaces

Subscriber exclusive content

A Red Hat subscription provides unlimited access to our knowledgebase of over 48,000 articles and solutions.

Current Customers and Partners

Log in for full access

Log In