clvmd blocks on one node while rejoining the cluster and the kernel shows backtraces for it waiting in dlm_new_lockspace in RHEL 6
Issue
- A node was fenced and when starting back up
clvmdbecomes blocked and never completes activation of volumes - I can't mount GFS2 file systems and lvm devices can't be found on the system, and
/var/log/messagesshowsclvmdbacktraces clvmdis stuck waiting indlm_new_lockspacewhen starting, and lvm commands block throughout the cluster
Jul 8 05:25:18 node1 kernel: dlm: Using TCP for communications
Jul 8 05:25:18 node1 kernel: dlm: connecting to 4
Jul 8 05:25:18 node1 kernel: dlm: connecting to 3
Jul 8 05:25:18 node1 kernel: dlm: connecting to 2
Jul 8 05:27:50 node1 kernel: INFO: task clvmd:28851 blocked for more than 120 seconds.
Jul 8 05:27:50 node1 kernel: Tainted: P --------------- 2.6.32-431.17.1.el6.x86_64 #1
Jul 8 05:27:50 node1 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Jul 8 05:27:50 node1 kernel: clvmd D 0000000000000011 0 28851 1 0x00000080
Jul 8 05:27:50 node1 kernel: ffff880829fe7c98 0000000000000086 0000000000000000 ffffffff810699d3
Jul 8 05:27:50 node1 kernel: ffff880829fe7c38 ffff88082ea6f538 ffff88082f340ae8 ffff88185c4168a8
Jul 8 05:27:50 node1 kernel: ffff88082ea6fab8 ffff880829fe7fd8 000000000000fbc8 ffff88082ea6fab8
Jul 8 05:27:50 node1 kernel: Call Trace:
Jul 8 05:27:50 node1 kernel: [<ffffffff810699d3>] ? dequeue_entity+0x113/0x2e0
Jul 8 05:27:50 node1 kernel: [<ffffffff81528a95>] schedule_timeout+0x215/0x2e0
Jul 8 05:27:50 node1 kernel: [<ffffffff81527bfe>] ? thread_return+0x4e/0x760
Jul 8 05:27:50 node1 kernel: [<ffffffff81285172>] ? kobject_uevent_env+0x202/0x620
Jul 8 05:27:50 node1 kernel: [<ffffffff81528713>] wait_for_common+0x123/0x180
Jul 8 05:27:50 node1 kernel: [<ffffffff81061d00>] ? default_wake_function+0x0/0x20
Jul 8 05:27:50 node1 kernel: [<ffffffff8152882d>] wait_for_completion+0x1d/0x20
Jul 8 05:27:50 node1 kernel: [<ffffffffa054cf79>] dlm_new_lockspace+0x999/0xa30 [dlm]
Jul 8 05:27:50 node1 kernel: [<ffffffffa0554ff1>] device_write+0x311/0x720 [dlm]
Jul 8 05:27:50 node1 kernel: [<ffffffff81061d00>] ? default_wake_function+0x0/0x20
Jul 8 05:27:50 node1 kernel: [<ffffffff81226056>] ? security_file_permission+0x16/0x20
Jul 8 05:27:50 node1 kernel: [<ffffffff81188c38>] vfs_write+0xb8/0x1a0
Jul 8 05:27:50 node1 kernel: [<ffffffff81189531>] sys_write+0x51/0x90
Jul 8 05:27:50 node1 kernel: [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b
Environment
- Red Hat Enterprise Linux (RHEL) 6 with the Resilient Storage Add On
lvm2-clusterclvmdis runninglocking_type = 3' in/etc/lvm/lvm.conf
- There are no ongoing problems in the cluster that would cause cluster services to become blocked, such as a loss of quorum or failed fencing
- One or more nodes shows a consistent, greater than 0 value for Recv-Q bytes for the DLM connection between itself and the node that is attempting to rejoin, as described in the Diagnostic Steps below
Subscriber exclusive content
A Red Hat subscription provides unlimited access to our knowledgebase, tools, and much more.