Why do lvm commands hang, GFS/GFS2 file systems become unresponsive, and cluster operations hang when fencing is failing in a RHEL High Availability cluster?

Solution Verified - Updated -

Issue

  • GFS2 filesystem hang, how can I recover from this situation? What's the solution?
  • With clvmd, running pvscan and lvm commands completely hang.

    # pvscan -vv
    Setting global/locking_type to 3
    Setting global/wait_for_locks to 1
    Cluster locking selected.
    
  • Why does access to GFS or GFS2 file systems hang when fencing is failing?

  • Why is my cluster inoperable after a node has crashed or become unresponsive?
  • After a node failed to be fenced by the cluster, we could not start clvmd on any nodes:

    Oct 25 08:25:51 node1 fenced[4671]: fence node3.example.com failed
    Oct 25 08:25:54 node1 fenced[4671]: fencing node node3.example.com
    Oct 25 08:26:11 node1 fenced[4671]: fence node3.example.com dev 0.0 agent fence_ipmilan result: error from agent
    
    # service clvmd start
    Starting clvmd: clvmd startup timed out
    
  • Services and processes like rgmanager, clvmd, gfs2_quotad, or other processes accessing or using the cluster infrastructure become blocked after fencing fails in a cluster:

    Apr  7 22:55:03 node1 kernel: INFO: task rgmanager:6739 blocked for more than 120 seconds.
    Apr  7 22:55:03 node1 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
    Apr  7 22:55:03 node1 kernel: rgmanager     D ffff88071fc28400     0  6739   6738 0x10000080
    Apr  7 22:55:03 node1 kernel: ffff880d85b59d48 0000000000000082 0000000000000000 ffff880d8cf19918
    Apr  7 22:55:03 node1 kernel: ffff880d85b02d80 0000000000000001 ffff880d85b01ff8 00007fffbffb00e0
    Apr  7 22:55:03 node1 kernel: ffff880d85b21a78 ffff880d85b59fd8 000000000000f4e8 ffff880d85b21a78
    Apr  7 22:55:03 node1 kernel: Call Trace:
    Apr  7 22:55:03 node1 kernel: [<ffffffff814ee0ae>] __mutex_lock_slowpath+0x13e/0x180
    Apr  7 22:55:03 node1 kernel: [<ffffffff814edf4b>] mutex_lock+0x2b/0x50
    Apr  7 22:55:03 node1 kernel: [<ffffffffa076c92c>] dlm_new_lockspace+0x3c/0xa30 [dlm]
    Apr  7 22:55:03 node1 kernel: [<ffffffff8115f40c>] ? __kmalloc+0x20c/0x220
    Apr  7 22:55:03 node1 kernel: [<ffffffffa077594d>] device_write+0x30d/0x7d0 [dlm]
    Apr  7 22:55:03 node1 kernel: [<ffffffff810eab02>] ? ring_buffer_lock_reserve+0xa2/0x160
    Apr  7 22:55:03 node1 kernel: [<ffffffff810d46e2>] ? audit_syscall_entry+0x272/0x2a0
    Apr  7 22:55:03 node1 kernel: [<ffffffff8120c3c6>] ? security_file_permission+0x16/0x20
    Apr  7 22:55:03 node1 kernel: [<ffffffff811765d8>] vfs_write+0xb8/0x1a0
    Apr  7 22:55:03 node1 kernel: [<ffffffff81176fe1>] sys_write+0x51/0x90
    Apr  7 22:55:03 node1 kernel: [<ffffffff8100b308>] tracesys+0xd9/0xde
    [...]   
    Apr  7 23:19:30 node1 kernel: INFO: task clvmd:5603 blocked for more than 120 seconds.
    Apr  7 23:19:30 node1 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
    Apr  7 23:19:30 node1 kernel: clvmd         D ffff88071fc28400     0  5603      1 0x10000080
    Apr  7 23:19:30 node1 kernel: ffff88070a983d48 0000000000000086 0000000000000000 ffff8806f6149700
    Apr  7 23:19:30 node1 kernel: ffff880d8d5842a8 0000000000000000 ffff88070a742cd8 00007f49b3655d50
    Apr  7 23:19:30 node1 kernel: ffff8806f629fa78 ffff88070a983fd8 000000000000f4e8 ffff8806f629fa78
    Apr  7 23:19:30 node1 kernel: Call Trace:
    Apr  7 23:19:30 node1 kernel: [<ffffffff814ee0ae>] __mutex_lock_slowpath+0x13e/0x180
    Apr  7 23:19:30 node1 kernel: [<ffffffff814edf4b>] mutex_lock+0x2b/0x50
    Apr  7 23:19:30 node1 kernel: [<ffffffffa077192c>] dlm_new_lockspace+0x3c/0xa30 [dlm]
    Apr  7 23:19:30 node1 kernel: [<ffffffff8115f40c>] ? __kmalloc+0x20c/0x220
    Apr  7 23:19:30 node1 kernel: [<ffffffffa077a94d>] device_write+0x30d/0x7d0 [dlm]
    Apr  7 23:19:30 node1 kernel: [<ffffffff810eab02>] ? ring_buffer_lock_reserve+0xa2/0x160
    Apr  7 23:19:30 node1 kernel: [<ffffffff810d46e2>] ? audit_syscall_entry+0x272/0x2a0
    Apr  7 23:19:30 node1 kernel: [<ffffffff8120c3c6>] ? security_file_permission+0x16/0x20
    Apr  7 23:19:30 node1 kernel: [<ffffffff811765d8>] vfs_write+0xb8/0x1a0
    Apr  7 23:19:30 node1 kernel: [<ffffffff81176fe1>] sys_write+0x51/0x90
    Apr  7 23:19:30 node1 kernel: [<ffffffff8100b308>] tracesys+0xd9/0xde
    
  • We have 2 node Red Hat Cluster with GFS filesystems. This systems once a week must be rebooted due to GFS hangs, if a member goes down the cluster gets down and must be rebooted?

  • GFS2 filesystem goes into hung state frequently and cluster nodes needs to rebooted to recover from the situation?
  • GFS2 Filesystem getting hung

Environment

  • Red Hat Enterprise Linux (RHEL) 5, 6, 7, 8, 9 with High Availability Add-On

Subscriber exclusive content

A Red Hat subscription provides unlimited access to our knowledgebase, tools, and much more.

Current Customers and Partners

Log in for full access

Log In

New to Red Hat?

Learn more about Red Hat subscriptions

Using a Red Hat product through a public cloud?

How to access this content