After a node is stopped and started without a reboot, there are 'dm-log-userspace: request timed out' errors and mirror devices hang in a RHEL 6 Resilient Storage cluster

Solution Unverified - Updated -

Issue

  • When selecting "Leave cluster" in the luci web interface, then selecting "Join cluster" to rejoin, the cycled node starts getting retry errors from dm-log-userspace and I/O to mirrored devices blocks throughout the cluster, and rebooting the node that just left and joined does not help.
device-mapper: dm-log-userspace: [LNnCNzH2] Request timed out: [5/65] - retrying
  • If I stop the cluster services on one node but leave cmirrord running, then start the cluster services back up, cmirrord prints a debug dump and then accessing mirror devices hangs
Apr 27 10:21:49 rhel6-node1 cmirrord[4858]: [Scxvd5CN] Failed to open checkpoint: SA_AIS_ERR_LIBRARY
Apr 27 10:21:49 rhel6-node1 cmirrord[4858]: [Scxvd5CN] Failed to import checkpoint from 2
Apr 27 10:21:49 rhel6-node1 cmirrord[4858]: 
Apr 27 10:21:49 rhel6-node1 cmirrord[4858]: LOG COMPONENT DEBUGGING::
Apr 27 10:21:49 rhel6-node1 cmirrord[4858]: Official log list:
Apr 27 10:21:49 rhel6-node1 cmirrord[4858]: Pending log list:
Apr 27 10:21:49 rhel6-node1 cmirrord[4858]: LVM-YAG4Jxj43bpMLmOqgQH2E8q8wUe8363lvUFosnDlhajUUDO7sJv1dkioScxvd5CN
Apr 27 10:21:49 rhel6-node1 cmirrord[4858]:   recoverer        : 0
Apr 27 10:21:49 rhel6-node1 cmirrord[4858]:   recovering_region: 18446744073709551615
Apr 27 10:21:49 rhel6-node1 cmirrord[4858]:   recovery_halted  : NO
Apr 27 10:21:49 rhel6-node1 cmirrord[4858]: sync_bits:
Apr 27 10:21:49 rhel6-node1 cmirrord[4858]: [  0 -  15] 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
Apr 27 10:21:49 rhel6-node1 cmirrord[4858]: [ 16 -  31] 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
Apr 27 10:21:49 rhel6-node1 cmirrord[4858]: [ 32 -  47] 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
Apr 27 10:21:49 rhel6-node1 cmirrord[4858]: [ 48 -  63] 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
Apr 27 10:21:49 rhel6-node1 cmirrord[4858]: [ 64 -  79] 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
Apr 27 10:21:49 rhel6-node1 cmirrord[4858]: [ 80 -  95] 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
Apr 27 10:21:49 rhel6-node1 cmirrord[4858]: [ 96 - 111] 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
Apr 27 10:21:49 rhel6-node1 cmirrord[4858]: [112 - 127] 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
Apr 27 10:21:49 rhel6-node1 cmirrord[4858]: [128 - 143] 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
Apr 27 10:21:49 rhel6-node1 cmirrord[4858]: [144 - 159] 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
Apr 27 10:21:49 rhel6-node1 cmirrord[4858]: [160 - 175] 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
Apr 27 10:21:49 rhel6-node1 cmirrord[4858]: [176 - 191] 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
Apr 27 10:21:49 rhel6-node1 cmirrord[4858]: [192 - 207] 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
Apr 27 10:21:49 rhel6-node1 cmirrord[4858]: [208 - 223] 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
Apr 27 10:21:49 rhel6-node1 cmirrord[4858]: [224 - 239] 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
Apr 27 10:21:49 rhel6-node1 cmirrord[4858]: [240 - 255] 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
Apr 27 10:21:49 rhel6-node1 cmirrord[4858]: clean_bits:
Apr 27 10:21:49 rhel6-node1 cmirrord[4858]: [  0 -  15] 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
Apr 27 10:21:49 rhel6-node1 cmirrord[4858]: [ 16 -  31] 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
Apr 27 10:21:49 rhel6-node1 cmirrord[4858]: [ 32 -  47] 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
Apr 27 10:21:49 rhel6-node1 cmirrord[4858]: [ 48 -  63] 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
Apr 27 10:21:49 rhel6-node1 cmirrord[4858]: [ 64 -  79] 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
Apr 27 10:21:49 rhel6-node1 cmirrord[4858]: [ 80 -  95] 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
Apr 27 10:21:49 rhel6-node1 cmirrord[4858]: [ 96 - 111] 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
Apr 27 10:21:49 rhel6-node1 cmirrord[4858]: [112 - 127] 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
Apr 27 10:21:49 rhel6-node1 cmirrord[4858]: [128 - 143] 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
Apr 27 10:21:49 rhel6-node1 cmirrord[4858]: [144 - 159] 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
Apr 27 10:21:49 rhel6-node1 cmirrord[4858]: [160 - 175] 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
Apr 27 10:21:49 rhel6-node1 cmirrord[4858]: [176 - 191] 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
Apr 27 10:21:49 rhel6-node1 cmirrord[4858]: [192 - 207] 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
Apr 27 10:21:49 rhel6-node1 cmirrord[4858]: [208 - 223] 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
Apr 27 10:21:49 rhel6-node1 cmirrord[4858]: [224 - 239] 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
Apr 27 10:21:49 rhel6-node1 cmirrord[4858]: [240 - 255] 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
Apr 27 10:21:49 rhel6-node1 cmirrord[4858]: Validating Scxvd5CN::
Apr 27 10:21:49 rhel6-node1 cmirrord[4858]:   lc->region_count = 2040
Apr 27 10:21:49 rhel6-node1 cmirrord[4858]:   lc->sync_count = 0
Apr 27 10:21:49 rhel6-node1 cmirrord[4858]:   next zero bit  = 0
Apr 27 10:21:49 rhel6-node1 cmirrord[4858]: Resync request history:
Apr 27 10:21:49 rhel6-node1 cmirrord[4858]: 
Apr 27 10:21:49 rhel6-node1 cmirrord[4858]: CLUSTER COMPONENT DEBUGGING::
Apr 27 10:21:49 rhel6-node1 cmirrord[4858]: Scxvd5CN::
Apr 27 10:21:49 rhel6-node1 cmirrord[4858]:   lowest_id         : 1
Apr 27 10:21:49 rhel6-node1 cmirrord[4858]:   state             : INVALID
Apr 27 10:21:49 rhel6-node1 cmirrord[4858]:   cpg_state         : 1
Apr 27 10:21:49 rhel6-node1 cmirrord[4858]:   free_me           : 0
Apr 27 10:21:49 rhel6-node1 cmirrord[4858]:   delay             : 0
Apr 27 10:21:49 rhel6-node1 cmirrord[4858]:   resend_requests   : 0
Apr 27 10:21:49 rhel6-node1 cmirrord[4858]:   checkpoints_needed: 0
Apr 27 10:21:49 rhel6-node1 cmirrord[4858]:   CKPTs waiting     : 0
Apr 27 10:21:49 rhel6-node1 cmirrord[4858]:   Working list:
Apr 27 10:21:49 rhel6-node1 cmirrord[4858]:   Startup list:
Apr 27 10:21:49 rhel6-node1 cmirrord[4858]:   DM_ULOG_RESUME/983
Apr 27 10:21:49 rhel6-node1 cmirrord[4858]: Command History:
Apr 27 10:21:49 rhel6-node1 cmirrord[4858]: 95:1) +++  UUID=Scxvd5CN  1 join  +++
Apr 27 10:21:49 rhel6-node1 cmirrord[4858]: 96:2) SEQ#=983, UUID=Scxvd5CN, TYPE=DM_ULOG_RESUME, ORIG=1, RESP=NO
Apr 27 10:21:49 rhel6-node1 cmirrord[4858]: 97:3) [Scxvd5CN] CHECKPOINT_READY notification from 2
Apr 27 10:21:49 rhel6-node1 cmirrord[4858]: 98:4) [Scxvd5CN] Failed to import checkpoint from 2
Apr 27 10:21:49 rhel6-node1 cmirrord[4858]: 99:5) SEQ#=2, UUID=Scxvd5CN, TYPE=DM_ULOG_CHECKPOINT_READY, ORIG=1, RESP=NO
Apr 27 10:22:04 rhel6-node1 kernel: device-mapper: dm-log-userspace: [Scxvd5CN] Request timed out: [5/983] - retrying
Apr 27 10:22:19 rhel6-node1 kernel: device-mapper: dm-log-userspace: [Scxvd5CN] Request timed out: [5/985] - retrying

Environment

  • Red Hat Enterprise Linux (RHEL) 6 with the Resilient Storage Add On
  • Red Hat Enterprise Linux (RHEL) 6 with the Resilient Storage Add On
  • lvm2-cluster
    • locking_type = 3 in /etc/lvm/lvm.conf
    • One or more volume group with the clustered attribute set
  • cmirror
    • One or more mirrored logical volumes in a clustered volume group

Subscriber exclusive content

A Red Hat subscription provides unlimited access to our knowledgebase, tools, and much more.

Current Customers and Partners

Log in for full access

Log In

New to Red Hat?

Learn more about Red Hat subscriptions

Using a Red Hat product through a public cloud?

How to access this content