clvmd start timed out with dlm socket error
Issue
- On RHEL 6, a node is able to rejoin the cluster briefly after fencing or qdisk eviction, but then it is immediately kicked out by another node.
Jun 14 02:39:33 node1 qdiskd[30861]: qdisk cycle took more than 5 seconds to complete (6.070000)
Jun 14 02:43:06 node1 kernel: imklog 5.8.10, log source = /proc/kmsg started.
Jun 14 02:43:12 node1 corosync[4659]: [MAIN ] Corosync Cluster Engine ('1.4.7'): started and ready to provide service.
Jun 14 02:43:13 node1 corosync[4659]: [QUORUM] Members[3]: 1 2 3
Jun 14 02:43:13 node1 corosync[4659]: [CPG ] chosen downlist: sender r(0) ip(10.238.11.147) ; members(old:1 left:0)
Jun 14 02:43:13 node1 corosync[4659]: [MAIN ] Completed service synchronization, ready to provide service.
Jun 14 02:43:13 node1 corosync[4659]: cman killed by node 2 because we were killed by cman_tool or other application
Jun 14 02:42:46 node2 qdiskd[8000]: Writing eviction notice for node 1
Jun 14 02:42:51 node2 qdiskd[8000]: Node 1 evicted
Jun 14 02:43:10 node3 kernel: dlm: node 3: socket error sending to node 1 at 10.0.0.1, port 21064, sk_err=104/113
Jun 14 02:43:13 node3 corosync[11504]: [QUORUM] Members[2]: 2 3
Jun 14 02:43:13 node3 corosync[11504]: [TOTEM ] A processor joined or left the membership and a new membership was formed.
Jun 14 02:43:13 node3 rgmanager[16234]: State change: node1.example.com DOWN
Jun 14 02:43:13 node3 corosync[11504]: [QUORUM] Members[3]: 1 2 3
Jun 14 02:43:13 node3 corosync[11504]: [QUORUM] Members[3]: 1 2 3
Jun 14 02:43:13 node3 corosync[11504]: [CPG ] chosen downlist: sender r(0) ip(10.0.0.2) ; members(old:3 left:1)
Jun 14 02:43:13 node3 corosync[11504]: [MAIN ] Completed service synchronization, ready to provide service.
Jun 14 02:43:13 node3 kernel: dlm: closing connection to node 1
- On RHEL 7, after a node is fenced,
clvmd
startup times out after adlm
socket error.
Oct 26 08:07:16 node2 kernel: dlm: Using TCP for communications
Oct 26 08:07:16 node2 kernel: dlm: connecting to 1
Oct 26 08:07:16 node2 kernel: dlm: node 2: socket error sending to node 1, port 21064, sk_err=104/0
...
Oct 26 08:12:20 node2 lrmd[1209]: warning: clvmd_start_0 process (PID 1593) timed out
Oct 26 08:12:20 node2 lrmd[1209]: warning: clvmd_start_0:1593 - timed out after 300000ms
Oct 26 08:12:20 node2 crmd[1212]: error: Result of start operation for clvmd on node2.example.com: Timed Out
Environment
- Red Hat Enterprise Linux 6 or 7 (with the High Availability Add-on)
dlm
andclvmd
Subscriber exclusive content
A Red Hat subscription provides unlimited access to our knowledgebase, tools, and much more.