clvmd start timed out with dlm socket error

Solution Verified - Updated -

Issue

  • On RHEL 6, a node is able to rejoin the cluster briefly after fencing or qdisk eviction, but then it is immediately kicked out by another node.
Jun 14 02:39:33 node1 qdiskd[30861]: qdisk cycle took more than 5 seconds to complete (6.070000)
Jun 14 02:43:06 node1 kernel: imklog 5.8.10, log source = /proc/kmsg started.
Jun 14 02:43:12 node1 corosync[4659]:   [MAIN  ] Corosync Cluster Engine ('1.4.7'): started and ready to provide service.
Jun 14 02:43:13 node1 corosync[4659]:   [QUORUM] Members[3]: 1 2 3
Jun 14 02:43:13 node1 corosync[4659]:   [CPG   ] chosen downlist: sender r(0) ip(10.238.11.147) ; members(old:1 left:0)
Jun 14 02:43:13 node1 corosync[4659]:   [MAIN  ] Completed service synchronization, ready to provide service.
Jun 14 02:43:13 node1 corosync[4659]: cman killed by node 2 because we were killed by cman_tool or other application

Jun 14 02:42:46 node2 qdiskd[8000]: Writing eviction notice for node 1
Jun 14 02:42:51 node2 qdiskd[8000]: Node 1 evicted

Jun 14 02:43:10 node3 kernel: dlm: node 3: socket error sending to node 1 at 10.238.11.147, port 21064, sk_err=104/113
Jun 14 02:43:13 node3 corosync[11504]:   [QUORUM] Members[2]: 2 3
Jun 14 02:43:13 node3 corosync[11504]:   [TOTEM ] A processor joined or left the membership and a new membership was formed.
Jun 14 02:43:13 node3 rgmanager[16234]: State change: node1.mgmt.ups.com DOWN
Jun 14 02:43:13 node3 corosync[11504]:   [QUORUM] Members[3]: 1 2 3
Jun 14 02:43:13 node3 corosync[11504]:   [QUORUM] Members[3]: 1 2 3
Jun 14 02:43:13 node3 corosync[11504]:   [CPG   ] chosen downlist: sender r(0) ip(10.238.11.148) ; members(old:3 left:1)
Jun 14 02:43:13 node3 corosync[11504]:   [MAIN  ] Completed service synchronization, ready to provide service.
Jun 14 02:43:13 node3 kernel: dlm: closing connection to node 1
  • On RHEL 7, after a node is fenced, clvmd startup times out after a dlm socket error.
Oct 26 08:07:16 rhel7n2 kernel: dlm: Using TCP for communications
Oct 26 08:07:16 rhel7n2 kernel: dlm: connecting to 1
Oct 26 08:07:16 rhel7n2 kernel: dlm: node 2: socket error sending to node 1, port 21064, sk_err=104/0
...
Oct 26 08:12:20 rhel7n2 lrmd[1209]: warning: clvmd_start_0 process (PID 1593) timed out
Oct 26 08:12:20 rhel7n2 lrmd[1209]: warning: clvmd_start_0:1593 - timed out after 300000ms
Oct 26 08:12:20 rhel7n2 crmd[1212]:   error: Result of start operation for clvmd on rhel7n2.example.com: Timed Out

Environment

  • Red Hat Enterprise Linux 6 or 7 (with the High Availability Add-on)
  • dlm and clvmd

Subscriber exclusive content

A Red Hat subscription provides unlimited access to our knowledgebase, tools, and much more.

Current Customers and Partners

Log in for full access

Log In