rgmanager or clustat unresponsive on one node when using altnames in /etc/cluster/cluster.conf in RHEL 6

Solution Verified - Updated -

Environment

Issue

  • clustat returns "Service states unavailable: Temporary failure; try again" on one node
Service states unavailable: Temporary failure; try again
Cluster Status for myCluster @ Tue Oct 29 15:47:31 2013
Member Status: Quorate

 Member Name                             ID   Status
 ------ ----                             ---- ------
 node1                                       1 Online
 node2                                       2 Online, Local
  • rgmanager seems to be stuck on one node in the cluster and won't allow me to start any services there

Resolution

  • Workaround: Reboot the entire cluster

Please contact Red Hat Global Support Services for assistance and additional diagnostic steps if you are encountering this issue.

Root Cause

When the cluster is configured to use RRP, rgmanager will start the cpglockd daemon and use it, instead of DLM, for processing lock requests. In this case, one of rgmanager's threads appears to have become blocked waiting on a response from cpglockd while holding a mutex, and its other threads have now become backed up behind it waiting on that mutex before they can request a lock themselves, causing nothing to get processed on that node.

A core of cpglockd shows that it is waiting on fenced to show that a node which was removed from the cluster has been fenced. In this particular case, that node has subsequently rejoined the cluster and been forcefully removed several more times, yet has not been fenced. A review of cluster dumps (see Diagnostic Steps below) shows that corosync never executed the cluster daemons' conf change callback functions, which has led to the node never being fenced.

Normally if a node rejoined the cluster after fencing had initiated but never completed, fenced would "avert" the pending fencing action. This would cause cpglockd to see that the node had been fenced since it was first removed and would then recover locks and proceed with its business. In this case, because fenced never attempted to fence the victim in the first place, its rejoining is not registering as a fence completion, and thus cpglockd continues to wait even after it has joined.

Red Hat is continuing to investigate this issue. Please contact Red Hat Global Support Services if you believe your cluster has experienced this issue.

Diagnostic Steps

  • Run cpglockdump and review the output for signs of why cpglockd might not be responding. If it is not returning on one or more nodes, this issue may be relevant.

  • Gather a core of rgmanager with gcore and analyze it to determine why its not responding. Look for several threads waiting on a mutex in cpg_lock, such as:

Thread 5 (Thread 0x7fd9ea0a1700 (LWP 52943)):
#0  __lll_lock_wait () at ../nptl/sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:136
#1  0x0000003b14c09388 in _L_lock_854 () from /lib64/libpthread-2.12.so
#2  0x0000003b14c09257 in __pthread_mutex_lock (mutex=0x15b7990) at pthread_mutex_lock.c:61
#3  0x0000000000423fdf in cpg_lock (handle=0x15b7990, resource=<value optimized out>, flags=<value optimized out>, lock=0x7fd9ea0a0a40) at /usr/src/debug/rgmanager-3.0.12.1/rgmanager/src/clulib/libcpglock.c:158
#4  0x0000000000423445 in _cpg_lock (mode=<value optimized out>, lksb=0x7fd9ea0a0c20, options=<value optimized out>, resource=<value optimized out>)
    at /usr/src/debug/rgmanager-3.0.12.1/rgmanager/src/clulib/cpg_lock.c:72
#5  0x0000000000412bf6 in rg_lock (name=<value optimized out>, p=0x7fd9ea0a0c20) at /usr/src/debug/rgmanager-3.0.12.1/rgmanager/src/daemons/rg_state.c:117
#6  0x00000000004069df in count_resource_groups (ml=0x7fd9bc0008c0) at /usr/src/debug/rgmanager-3.0.12.1/rgmanager/src/daemons/groups.c:108
#7  0x00000000004075a8 in eval_groups (local=1, nodeid=2, nodeStatus=1) at /usr/src/debug/rgmanager-3.0.12.1/rgmanager/src/daemons/groups.c:699
#8  0x0000000000411b31 in _event_thread_f (arg=<value optimized out>) at /usr/src/debug/rgmanager-3.0.12.1/rgmanager/src/daemons/rg_event.c:449
#9  0x0000003b14c07851 in start_thread (arg=0x7fd9ea0a1700) at pthread_create.c:301
#10 0x0000003b148e894d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:115

If this is the case, then find the thread that is holding the mutex, such as:

Thread 7 (Thread 0x7fd9ea0606e0 (LWP 52953)):
#0  0x0000003b148e1533 in select () at ../sysdeps/unix/syscall-template.S:82
#1  0x0000000000424315 in select_retry (fdmax=9, rfds=0x7fd9ea05f8d0, wfds=0x0, xfds=0x7fd9ea05f850, timeout=0x7fd9ea05f9c0) at /usr/src/debug/rgmanager-3.0.12.1/rgmanager/src/clulib/sock.c:65
#2  0x00000000004247e1 in read_retry (sockfd=8, buf=0x7fd9ea05fa50, count=128, timeout=0x7fd9ea05f9c0) at /usr/src/debug/rgmanager-3.0.12.1/rgmanager/src/clulib/sock.c:175
#3  0x0000000000423f28 in cpg_lock (handle=0x15b7990, resource=<value optimized out>, flags=<value optimized out>, lock=0x7fd9ea05fba0) at /usr/src/debug/rgmanager-3.0.12.1/rgmanager/src/clulib/libcpglock.c:137
#4  0x0000000000423445 in _cpg_lock (mode=<value optimized out>, lksb=0x7fd9ea05fd90, options=<value optimized out>, resource=<value optimized out>)
    at /usr/src/debug/rgmanager-3.0.12.1/rgmanager/src/clulib/cpg_lock.c:72
#5  0x0000000000412bf6 in rg_lock (name=<value optimized out>, p=0x7fd9ea05fd90) at /usr/src/debug/rgmanager-3.0.12.1/rgmanager/src/daemons/rg_state.c:117
#6  0x000000000040629b in q_status_checks (arg=<value optimized out>) at /usr/src/debug/rgmanager-3.0.12.1/rgmanager/src/daemons/groups.c:1355
#7  0x0000003b14c07851 in start_thread (arg=0x7fd9ea0606e0) at pthread_create.c:301
#8  0x0000003b148e894d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:115

In this situation, multiple threads are waiting to request a CPG lock, but must first wait for the mutex. The thread that has it is waiting on a select against the cpglockd file descriptor, waiting on a response. Once that select times out, it releases the mutex and sleeps. This allows another thread to get the mutex, request a lock, time out, release the mutex, sleep, etc and the threads all just bounce around like this. For this reason, we must determine what cpglockd is doing and why its not responding before we can solve the issue.

  • Capture a core of cpglockd on the node where cpglockdump is not returning and review the state. Look for the main thread sleeping:
(gdb) thread 2
[Switching to thread 2 (Thread 0x7fc0b3d1d700 (LWP 12905))]#0  0x0000003fbf2acbcd in nanosleep () at ../sysdeps/unix/syscall-template.S:82
82  T_PSEUDO (SYSCALL_SYMBOL, SYSCALL_NAME, SYSCALL_NARGS)
(gdb) bt
#0  0x0000003fbf2acbcd in nanosleep () at ../sysdeps/unix/syscall-template.S:82
#1  0x0000003fbf2e1da4 in usleep (useconds=<value optimized out>) at ../sysdeps/unix/sysv/linux/usleep.c:33
#2  0x00000000004040a1 in main (argc=<value optimized out>, argv=<value optimized out>) at /usr/src/debug/rgmanager-3.0.12.1/rgmanager/src/daemons/cpglockd.c:1710

This indicates that cpglockd received a state change and is waiting on confirmation from cman/fenced that the node has been fenced (sleeping for .5 seconds between each check). We can see what time the node failed and compare that to the logs.

  (gdb) p pending_fencing
  $1 = (struct pending_fence_node *) 0xb2e590
  (gdb) p *pending_fencing
  $2 = {_list_head = {le_next = 0xb2e5c0, le_prev = 0xb2e650}, nodeid = 2, force_wait = 1, fail_time = 1382733667}
  • Review the fence_tool dump, gfs_control dump, and dlm_tool dump output to see if after the node removal at the pending_fencing time (1382733667 in the above example), we're missing a configuration change. In fence_tool dump, we're looking for an add_change cg <id> remove nodeid <id> reason <reason> line. In this example its missing:
  1382733667 cluster node 2 removed seq 1004
  1382733667 fenced:daemon conf 1 0 1 memb 1 join left 2
  1382733667 fenced:default ring 1:1004 1 memb 1
  1382733667 fenced:daemon ring 1:1004 1 memb 1
  1382734258 cluster node 2 added seq 1008  <-- rejoin 9 minutes later

Compared to if we had seen one:

  1383671248 cluster node 2 removed seq 12376
  1383671248 fenced:daemon conf 1 0 1 memb 1 join left 2
  1383671248 fenced:daemon ring 1:12376 1 memb 1
  1383671248 fenced:default conf 1 0 1 memb 1 join left 2
  1383671248 add_change cg 2 remove nodeid 2 reason 3
  1383671248 add_change cg 2 m 1 j 0 r 1 f 1
  1383671248 add_victims node 2
  [...]

Or in gfs_control dump if the gfs:controld conf line is missing:

  1382733667 cluster node 2 removed seq 1004
  1382734258 cluster node 2 added seq 1008    <-- rejoin 9 minutes later

instead of:

  1383671248 cluster node 2 removed seq 12376
  1383671248 gfs:controld conf 1 0 1 memb 1 join left 2

And dlm_tool dump missing the dlm:controld conf line:

  1382733667 cluster node 2 removed seq 1004
  1382733667 del_configfs_node rmdir "/sys/kernel/config/dlm/cluster/comms/2"
  1382734258 cluster node 2 added seq 1008      <-- rejoin 9 minutes later

instead of:

  1383671248 cluster node 2 removed seq 12376
  1383671248 del_configfs_node rmdir "/sys/kernel/config/dlm/cluster/comms/2"
  1383671248 dlm:controld conf 1 0 1 memb 1 join left 2

NOTE: gfs_controld and dlm_controld not receiving conf changes doesn't have any impact on cpglockd or rgmanager. This is just further evidence that corosync did not execute said conf changes.

  • Capture a core of corosync with gcore $(pidof corosync) on all nodes.

This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.

Comments