clvmd blocks on one node while rejoining the cluster and the kernel shows backtraces for it waiting in dlm_new_lockspace in RHEL 6

Solution Unverified - Updated -

Issue

  • A node was fenced and when starting back up clvmd becomes blocked and never completes activation of volumes
  • I can't mount GFS2 file systems and lvm devices can't be found on the system, and /var/log/messages shows clvmd backtraces
  • clvmd is stuck waiting in dlm_new_lockspace when starting, and lvm commands block throughout the cluster
Jul  8 05:25:18 node1 kernel: dlm: Using TCP for communications
Jul  8 05:25:18 node1 kernel: dlm: connecting to 4
Jul  8 05:25:18 node1 kernel: dlm: connecting to 3
Jul  8 05:25:18 node1 kernel: dlm: connecting to 2
Jul  8 05:27:50 node1 kernel: INFO: task clvmd:28851 blocked for more than 120 seconds.
Jul  8 05:27:50 node1 kernel:      Tainted: P           ---------------    2.6.32-431.17.1.el6.x86_64 #1
Jul  8 05:27:50 node1 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Jul  8 05:27:50 node1 kernel: clvmd         D 0000000000000011     0 28851      1 0x00000080
Jul  8 05:27:50 node1 kernel: ffff880829fe7c98 0000000000000086 0000000000000000 ffffffff810699d3
Jul  8 05:27:50 node1 kernel: ffff880829fe7c38 ffff88082ea6f538 ffff88082f340ae8 ffff88185c4168a8
Jul  8 05:27:50 node1 kernel: ffff88082ea6fab8 ffff880829fe7fd8 000000000000fbc8 ffff88082ea6fab8
Jul  8 05:27:50 node1 kernel: Call Trace:
Jul  8 05:27:50 node1 kernel: [<ffffffff810699d3>] ? dequeue_entity+0x113/0x2e0
Jul  8 05:27:50 node1 kernel: [<ffffffff81528a95>] schedule_timeout+0x215/0x2e0
Jul  8 05:27:50 node1 kernel: [<ffffffff81527bfe>] ? thread_return+0x4e/0x760
Jul  8 05:27:50 node1 kernel: [<ffffffff81285172>] ? kobject_uevent_env+0x202/0x620
Jul  8 05:27:50 node1 kernel: [<ffffffff81528713>] wait_for_common+0x123/0x180
Jul  8 05:27:50 node1 kernel: [<ffffffff81061d00>] ? default_wake_function+0x0/0x20
Jul  8 05:27:50 node1 kernel: [<ffffffff8152882d>] wait_for_completion+0x1d/0x20
Jul  8 05:27:50 node1 kernel: [<ffffffffa054cf79>] dlm_new_lockspace+0x999/0xa30 [dlm]
Jul  8 05:27:50 node1 kernel: [<ffffffffa0554ff1>] device_write+0x311/0x720 [dlm]
Jul  8 05:27:50 node1 kernel: [<ffffffff81061d00>] ? default_wake_function+0x0/0x20
Jul  8 05:27:50 node1 kernel: [<ffffffff81226056>] ? security_file_permission+0x16/0x20
Jul  8 05:27:50 node1 kernel: [<ffffffff81188c38>] vfs_write+0xb8/0x1a0
Jul  8 05:27:50 node1 kernel: [<ffffffff81189531>] sys_write+0x51/0x90
Jul  8 05:27:50 node1 kernel: [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b

Environment

  • Red Hat Enterprise Linux (RHEL) 6 with the Resilient Storage Add On
  • lvm2-cluster
    • clvmd is running
    • locking_type = 3' in /etc/lvm/lvm.conf
  • There are no ongoing problems in the cluster that would cause cluster services to become blocked, such as a loss of quorum or failed fencing
  • One or more nodes shows a consistent, greater than 0 value for Recv-Q bytes for the DLM connection between itself and the node that is attempting to rejoin, as described in the Diagnostic Steps below

Subscriber exclusive content

A Red Hat subscription provides unlimited access to our knowledgebase of over 48,000 articles and solutions.

Current Customers and Partners

Log in for full access

Log In
Close

Welcome! Check out the Getting Started with Red Hat page for quick tours and guides for common tasks.