After a node is stopped and started without a reboot, there are 'dm-log-userspace: request timed out' errors and mirror devices hang in a RHEL 6 Resilient Storage cluster
Issue
- When selecting "Leave cluster" in the
luciweb interface, then selecting "Join cluster" to rejoin, the cycled node starts getting retry errors fromdm-log-userspaceand I/O to mirrored devices blocks throughout the cluster, and rebooting the node that just left and joined does not help.
device-mapper: dm-log-userspace: [LNnCNzH2] Request timed out: [5/65] - retrying
- If I stop the cluster services on one node but leave
cmirrordrunning, then start the cluster services back up,cmirrordprints a debug dump and then accessing mirror devices hangs
Apr 27 10:21:49 rhel6-node1 cmirrord[4858]: [Scxvd5CN] Failed to open checkpoint: SA_AIS_ERR_LIBRARY
Apr 27 10:21:49 rhel6-node1 cmirrord[4858]: [Scxvd5CN] Failed to import checkpoint from 2
Apr 27 10:21:49 rhel6-node1 cmirrord[4858]:
Apr 27 10:21:49 rhel6-node1 cmirrord[4858]: LOG COMPONENT DEBUGGING::
Apr 27 10:21:49 rhel6-node1 cmirrord[4858]: Official log list:
Apr 27 10:21:49 rhel6-node1 cmirrord[4858]: Pending log list:
Apr 27 10:21:49 rhel6-node1 cmirrord[4858]: LVM-YAG4Jxj43bpMLmOqgQH2E8q8wUe8363lvUFosnDlhajUUDO7sJv1dkioScxvd5CN
Apr 27 10:21:49 rhel6-node1 cmirrord[4858]: recoverer : 0
Apr 27 10:21:49 rhel6-node1 cmirrord[4858]: recovering_region: 18446744073709551615
Apr 27 10:21:49 rhel6-node1 cmirrord[4858]: recovery_halted : NO
Apr 27 10:21:49 rhel6-node1 cmirrord[4858]: sync_bits:
Apr 27 10:21:49 rhel6-node1 cmirrord[4858]: [ 0 - 15] 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
Apr 27 10:21:49 rhel6-node1 cmirrord[4858]: [ 16 - 31] 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
Apr 27 10:21:49 rhel6-node1 cmirrord[4858]: [ 32 - 47] 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
Apr 27 10:21:49 rhel6-node1 cmirrord[4858]: [ 48 - 63] 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
Apr 27 10:21:49 rhel6-node1 cmirrord[4858]: [ 64 - 79] 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
Apr 27 10:21:49 rhel6-node1 cmirrord[4858]: [ 80 - 95] 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
Apr 27 10:21:49 rhel6-node1 cmirrord[4858]: [ 96 - 111] 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
Apr 27 10:21:49 rhel6-node1 cmirrord[4858]: [112 - 127] 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
Apr 27 10:21:49 rhel6-node1 cmirrord[4858]: [128 - 143] 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
Apr 27 10:21:49 rhel6-node1 cmirrord[4858]: [144 - 159] 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
Apr 27 10:21:49 rhel6-node1 cmirrord[4858]: [160 - 175] 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
Apr 27 10:21:49 rhel6-node1 cmirrord[4858]: [176 - 191] 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
Apr 27 10:21:49 rhel6-node1 cmirrord[4858]: [192 - 207] 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
Apr 27 10:21:49 rhel6-node1 cmirrord[4858]: [208 - 223] 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
Apr 27 10:21:49 rhel6-node1 cmirrord[4858]: [224 - 239] 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
Apr 27 10:21:49 rhel6-node1 cmirrord[4858]: [240 - 255] 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
Apr 27 10:21:49 rhel6-node1 cmirrord[4858]: clean_bits:
Apr 27 10:21:49 rhel6-node1 cmirrord[4858]: [ 0 - 15] 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
Apr 27 10:21:49 rhel6-node1 cmirrord[4858]: [ 16 - 31] 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
Apr 27 10:21:49 rhel6-node1 cmirrord[4858]: [ 32 - 47] 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
Apr 27 10:21:49 rhel6-node1 cmirrord[4858]: [ 48 - 63] 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
Apr 27 10:21:49 rhel6-node1 cmirrord[4858]: [ 64 - 79] 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
Apr 27 10:21:49 rhel6-node1 cmirrord[4858]: [ 80 - 95] 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
Apr 27 10:21:49 rhel6-node1 cmirrord[4858]: [ 96 - 111] 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
Apr 27 10:21:49 rhel6-node1 cmirrord[4858]: [112 - 127] 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
Apr 27 10:21:49 rhel6-node1 cmirrord[4858]: [128 - 143] 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
Apr 27 10:21:49 rhel6-node1 cmirrord[4858]: [144 - 159] 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
Apr 27 10:21:49 rhel6-node1 cmirrord[4858]: [160 - 175] 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
Apr 27 10:21:49 rhel6-node1 cmirrord[4858]: [176 - 191] 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
Apr 27 10:21:49 rhel6-node1 cmirrord[4858]: [192 - 207] 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
Apr 27 10:21:49 rhel6-node1 cmirrord[4858]: [208 - 223] 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
Apr 27 10:21:49 rhel6-node1 cmirrord[4858]: [224 - 239] 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
Apr 27 10:21:49 rhel6-node1 cmirrord[4858]: [240 - 255] 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
Apr 27 10:21:49 rhel6-node1 cmirrord[4858]: Validating Scxvd5CN::
Apr 27 10:21:49 rhel6-node1 cmirrord[4858]: lc->region_count = 2040
Apr 27 10:21:49 rhel6-node1 cmirrord[4858]: lc->sync_count = 0
Apr 27 10:21:49 rhel6-node1 cmirrord[4858]: next zero bit = 0
Apr 27 10:21:49 rhel6-node1 cmirrord[4858]: Resync request history:
Apr 27 10:21:49 rhel6-node1 cmirrord[4858]:
Apr 27 10:21:49 rhel6-node1 cmirrord[4858]: CLUSTER COMPONENT DEBUGGING::
Apr 27 10:21:49 rhel6-node1 cmirrord[4858]: Scxvd5CN::
Apr 27 10:21:49 rhel6-node1 cmirrord[4858]: lowest_id : 1
Apr 27 10:21:49 rhel6-node1 cmirrord[4858]: state : INVALID
Apr 27 10:21:49 rhel6-node1 cmirrord[4858]: cpg_state : 1
Apr 27 10:21:49 rhel6-node1 cmirrord[4858]: free_me : 0
Apr 27 10:21:49 rhel6-node1 cmirrord[4858]: delay : 0
Apr 27 10:21:49 rhel6-node1 cmirrord[4858]: resend_requests : 0
Apr 27 10:21:49 rhel6-node1 cmirrord[4858]: checkpoints_needed: 0
Apr 27 10:21:49 rhel6-node1 cmirrord[4858]: CKPTs waiting : 0
Apr 27 10:21:49 rhel6-node1 cmirrord[4858]: Working list:
Apr 27 10:21:49 rhel6-node1 cmirrord[4858]: Startup list:
Apr 27 10:21:49 rhel6-node1 cmirrord[4858]: DM_ULOG_RESUME/983
Apr 27 10:21:49 rhel6-node1 cmirrord[4858]: Command History:
Apr 27 10:21:49 rhel6-node1 cmirrord[4858]: 95:1) +++ UUID=Scxvd5CN 1 join +++
Apr 27 10:21:49 rhel6-node1 cmirrord[4858]: 96:2) SEQ#=983, UUID=Scxvd5CN, TYPE=DM_ULOG_RESUME, ORIG=1, RESP=NO
Apr 27 10:21:49 rhel6-node1 cmirrord[4858]: 97:3) [Scxvd5CN] CHECKPOINT_READY notification from 2
Apr 27 10:21:49 rhel6-node1 cmirrord[4858]: 98:4) [Scxvd5CN] Failed to import checkpoint from 2
Apr 27 10:21:49 rhel6-node1 cmirrord[4858]: 99:5) SEQ#=2, UUID=Scxvd5CN, TYPE=DM_ULOG_CHECKPOINT_READY, ORIG=1, RESP=NO
Apr 27 10:22:04 rhel6-node1 kernel: device-mapper: dm-log-userspace: [Scxvd5CN] Request timed out: [5/983] - retrying
Apr 27 10:22:19 rhel6-node1 kernel: device-mapper: dm-log-userspace: [Scxvd5CN] Request timed out: [5/985] - retrying
Environment
- Red Hat Enterprise Linux (RHEL) 6 with the Resilient Storage Add On
- Red Hat Enterprise Linux (RHEL) 6 with the Resilient Storage Add On
lvm2-clusterlocking_type = 3in/etc/lvm/lvm.conf- One or more volume group with the clustered attribute set
cmirror- One or more mirrored logical volumes in a clustered volume group
Subscriber exclusive content
A Red Hat subscription provides unlimited access to our knowledgebase of over 48,000 articles and solutions.
Welcome! Check out the Getting Started with Red Hat page for quick tours and guides for common tasks.
