Why do lvm commands hang, GFS/GFS2 file systems become unresponsive, and cluster operations hang when fencing is failing in a RHEL High Availability cluster?
Issue
GFS2
filesystem hang, how can I recover from this situation? What's the solution?-
With
clvmd
, runningpvscan
and lvm commands completely hang.# pvscan -vv Setting global/locking_type to 3 Setting global/wait_for_locks to 1 Cluster locking selected.
-
Why does access to GFS or GFS2 file systems hang when fencing is failing?
- Why is my cluster inoperable after a node has crashed or become unresponsive?
-
After a node failed to be fenced by the cluster, we could not start
clvmd
on any nodes:Oct 25 08:25:51 node1 fenced[4671]: fence node3.example.com failed Oct 25 08:25:54 node1 fenced[4671]: fencing node node3.example.com Oct 25 08:26:11 node1 fenced[4671]: fence node3.example.com dev 0.0 agent fence_ipmilan result: error from agent # service clvmd start Starting clvmd: clvmd startup timed out
-
Services and processes like
rgmanager
,clvmd
,gfs2_quotad
, or other processes accessing or using the cluster infrastructure become blocked after fencing fails in a cluster:Apr 7 22:55:03 node1 kernel: INFO: task rgmanager:6739 blocked for more than 120 seconds. Apr 7 22:55:03 node1 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. Apr 7 22:55:03 node1 kernel: rgmanager D ffff88071fc28400 0 6739 6738 0x10000080 Apr 7 22:55:03 node1 kernel: ffff880d85b59d48 0000000000000082 0000000000000000 ffff880d8cf19918 Apr 7 22:55:03 node1 kernel: ffff880d85b02d80 0000000000000001 ffff880d85b01ff8 00007fffbffb00e0 Apr 7 22:55:03 node1 kernel: ffff880d85b21a78 ffff880d85b59fd8 000000000000f4e8 ffff880d85b21a78 Apr 7 22:55:03 node1 kernel: Call Trace: Apr 7 22:55:03 node1 kernel: [<ffffffff814ee0ae>] __mutex_lock_slowpath+0x13e/0x180 Apr 7 22:55:03 node1 kernel: [<ffffffff814edf4b>] mutex_lock+0x2b/0x50 Apr 7 22:55:03 node1 kernel: [<ffffffffa076c92c>] dlm_new_lockspace+0x3c/0xa30 [dlm] Apr 7 22:55:03 node1 kernel: [<ffffffff8115f40c>] ? __kmalloc+0x20c/0x220 Apr 7 22:55:03 node1 kernel: [<ffffffffa077594d>] device_write+0x30d/0x7d0 [dlm] Apr 7 22:55:03 node1 kernel: [<ffffffff810eab02>] ? ring_buffer_lock_reserve+0xa2/0x160 Apr 7 22:55:03 node1 kernel: [<ffffffff810d46e2>] ? audit_syscall_entry+0x272/0x2a0 Apr 7 22:55:03 node1 kernel: [<ffffffff8120c3c6>] ? security_file_permission+0x16/0x20 Apr 7 22:55:03 node1 kernel: [<ffffffff811765d8>] vfs_write+0xb8/0x1a0 Apr 7 22:55:03 node1 kernel: [<ffffffff81176fe1>] sys_write+0x51/0x90 Apr 7 22:55:03 node1 kernel: [<ffffffff8100b308>] tracesys+0xd9/0xde [...] Apr 7 23:19:30 node1 kernel: INFO: task clvmd:5603 blocked for more than 120 seconds. Apr 7 23:19:30 node1 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. Apr 7 23:19:30 node1 kernel: clvmd D ffff88071fc28400 0 5603 1 0x10000080 Apr 7 23:19:30 node1 kernel: ffff88070a983d48 0000000000000086 0000000000000000 ffff8806f6149700 Apr 7 23:19:30 node1 kernel: ffff880d8d5842a8 0000000000000000 ffff88070a742cd8 00007f49b3655d50 Apr 7 23:19:30 node1 kernel: ffff8806f629fa78 ffff88070a983fd8 000000000000f4e8 ffff8806f629fa78 Apr 7 23:19:30 node1 kernel: Call Trace: Apr 7 23:19:30 node1 kernel: [<ffffffff814ee0ae>] __mutex_lock_slowpath+0x13e/0x180 Apr 7 23:19:30 node1 kernel: [<ffffffff814edf4b>] mutex_lock+0x2b/0x50 Apr 7 23:19:30 node1 kernel: [<ffffffffa077192c>] dlm_new_lockspace+0x3c/0xa30 [dlm] Apr 7 23:19:30 node1 kernel: [<ffffffff8115f40c>] ? __kmalloc+0x20c/0x220 Apr 7 23:19:30 node1 kernel: [<ffffffffa077a94d>] device_write+0x30d/0x7d0 [dlm] Apr 7 23:19:30 node1 kernel: [<ffffffff810eab02>] ? ring_buffer_lock_reserve+0xa2/0x160 Apr 7 23:19:30 node1 kernel: [<ffffffff810d46e2>] ? audit_syscall_entry+0x272/0x2a0 Apr 7 23:19:30 node1 kernel: [<ffffffff8120c3c6>] ? security_file_permission+0x16/0x20 Apr 7 23:19:30 node1 kernel: [<ffffffff811765d8>] vfs_write+0xb8/0x1a0 Apr 7 23:19:30 node1 kernel: [<ffffffff81176fe1>] sys_write+0x51/0x90 Apr 7 23:19:30 node1 kernel: [<ffffffff8100b308>] tracesys+0xd9/0xde
-
We have 2 node Red Hat Cluster with GFS filesystems. This systems once a week must be rebooted due to GFS hangs, if a member goes down the cluster gets down and must be rebooted?
- GFS2 filesystem goes into hung state frequently and cluster nodes needs to rebooted to recover from the situation?
- GFS2 Filesystem getting hung
Environment
- Red Hat Enterprise Linux (RHEL) 5, 6, 7, 8, 9 with High Availability Add-On
Subscriber exclusive content
A Red Hat subscription provides unlimited access to our knowledgebase, tools, and much more.