After a node is stopped and started without a reboot, there are 'dm-log-userspace: request timed out' errors and mirror devices hang in a RHEL 6 Resilient Storage cluster
Issue
- When selecting "Leave cluster" in the
luci
web interface, then selecting "Join cluster" to rejoin, the cycled node starts getting retry errors fromdm-log-userspace
and I/O to mirrored devices blocks throughout the cluster, and rebooting the node that just left and joined does not help.
device-mapper: dm-log-userspace: [LNnCNzH2] Request timed out: [5/65] - retrying
- If I stop the cluster services on one node but leave
cmirrord
running, then start the cluster services back up,cmirrord
prints a debug dump and then accessing mirror devices hangs
Apr 27 10:21:49 rhel6-node1 cmirrord[4858]: [Scxvd5CN] Failed to open checkpoint: SA_AIS_ERR_LIBRARY
Apr 27 10:21:49 rhel6-node1 cmirrord[4858]: [Scxvd5CN] Failed to import checkpoint from 2
Apr 27 10:21:49 rhel6-node1 cmirrord[4858]:
Apr 27 10:21:49 rhel6-node1 cmirrord[4858]: LOG COMPONENT DEBUGGING::
Apr 27 10:21:49 rhel6-node1 cmirrord[4858]: Official log list:
Apr 27 10:21:49 rhel6-node1 cmirrord[4858]: Pending log list:
Apr 27 10:21:49 rhel6-node1 cmirrord[4858]: LVM-YAG4Jxj43bpMLmOqgQH2E8q8wUe8363lvUFosnDlhajUUDO7sJv1dkioScxvd5CN
Apr 27 10:21:49 rhel6-node1 cmirrord[4858]: recoverer : 0
Apr 27 10:21:49 rhel6-node1 cmirrord[4858]: recovering_region: 18446744073709551615
Apr 27 10:21:49 rhel6-node1 cmirrord[4858]: recovery_halted : NO
Apr 27 10:21:49 rhel6-node1 cmirrord[4858]: sync_bits:
Apr 27 10:21:49 rhel6-node1 cmirrord[4858]: [ 0 - 15] 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
Apr 27 10:21:49 rhel6-node1 cmirrord[4858]: [ 16 - 31] 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
Apr 27 10:21:49 rhel6-node1 cmirrord[4858]: [ 32 - 47] 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
Apr 27 10:21:49 rhel6-node1 cmirrord[4858]: [ 48 - 63] 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
Apr 27 10:21:49 rhel6-node1 cmirrord[4858]: [ 64 - 79] 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
Apr 27 10:21:49 rhel6-node1 cmirrord[4858]: [ 80 - 95] 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
Apr 27 10:21:49 rhel6-node1 cmirrord[4858]: [ 96 - 111] 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
Apr 27 10:21:49 rhel6-node1 cmirrord[4858]: [112 - 127] 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
Apr 27 10:21:49 rhel6-node1 cmirrord[4858]: [128 - 143] 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
Apr 27 10:21:49 rhel6-node1 cmirrord[4858]: [144 - 159] 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
Apr 27 10:21:49 rhel6-node1 cmirrord[4858]: [160 - 175] 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
Apr 27 10:21:49 rhel6-node1 cmirrord[4858]: [176 - 191] 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
Apr 27 10:21:49 rhel6-node1 cmirrord[4858]: [192 - 207] 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
Apr 27 10:21:49 rhel6-node1 cmirrord[4858]: [208 - 223] 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
Apr 27 10:21:49 rhel6-node1 cmirrord[4858]: [224 - 239] 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
Apr 27 10:21:49 rhel6-node1 cmirrord[4858]: [240 - 255] 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
Apr 27 10:21:49 rhel6-node1 cmirrord[4858]: clean_bits:
Apr 27 10:21:49 rhel6-node1 cmirrord[4858]: [ 0 - 15] 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
Apr 27 10:21:49 rhel6-node1 cmirrord[4858]: [ 16 - 31] 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
Apr 27 10:21:49 rhel6-node1 cmirrord[4858]: [ 32 - 47] 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
Apr 27 10:21:49 rhel6-node1 cmirrord[4858]: [ 48 - 63] 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
Apr 27 10:21:49 rhel6-node1 cmirrord[4858]: [ 64 - 79] 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
Apr 27 10:21:49 rhel6-node1 cmirrord[4858]: [ 80 - 95] 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
Apr 27 10:21:49 rhel6-node1 cmirrord[4858]: [ 96 - 111] 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
Apr 27 10:21:49 rhel6-node1 cmirrord[4858]: [112 - 127] 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
Apr 27 10:21:49 rhel6-node1 cmirrord[4858]: [128 - 143] 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
Apr 27 10:21:49 rhel6-node1 cmirrord[4858]: [144 - 159] 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
Apr 27 10:21:49 rhel6-node1 cmirrord[4858]: [160 - 175] 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
Apr 27 10:21:49 rhel6-node1 cmirrord[4858]: [176 - 191] 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
Apr 27 10:21:49 rhel6-node1 cmirrord[4858]: [192 - 207] 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
Apr 27 10:21:49 rhel6-node1 cmirrord[4858]: [208 - 223] 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
Apr 27 10:21:49 rhel6-node1 cmirrord[4858]: [224 - 239] 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
Apr 27 10:21:49 rhel6-node1 cmirrord[4858]: [240 - 255] 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
Apr 27 10:21:49 rhel6-node1 cmirrord[4858]: Validating Scxvd5CN::
Apr 27 10:21:49 rhel6-node1 cmirrord[4858]: lc->region_count = 2040
Apr 27 10:21:49 rhel6-node1 cmirrord[4858]: lc->sync_count = 0
Apr 27 10:21:49 rhel6-node1 cmirrord[4858]: next zero bit = 0
Apr 27 10:21:49 rhel6-node1 cmirrord[4858]: Resync request history:
Apr 27 10:21:49 rhel6-node1 cmirrord[4858]:
Apr 27 10:21:49 rhel6-node1 cmirrord[4858]: CLUSTER COMPONENT DEBUGGING::
Apr 27 10:21:49 rhel6-node1 cmirrord[4858]: Scxvd5CN::
Apr 27 10:21:49 rhel6-node1 cmirrord[4858]: lowest_id : 1
Apr 27 10:21:49 rhel6-node1 cmirrord[4858]: state : INVALID
Apr 27 10:21:49 rhel6-node1 cmirrord[4858]: cpg_state : 1
Apr 27 10:21:49 rhel6-node1 cmirrord[4858]: free_me : 0
Apr 27 10:21:49 rhel6-node1 cmirrord[4858]: delay : 0
Apr 27 10:21:49 rhel6-node1 cmirrord[4858]: resend_requests : 0
Apr 27 10:21:49 rhel6-node1 cmirrord[4858]: checkpoints_needed: 0
Apr 27 10:21:49 rhel6-node1 cmirrord[4858]: CKPTs waiting : 0
Apr 27 10:21:49 rhel6-node1 cmirrord[4858]: Working list:
Apr 27 10:21:49 rhel6-node1 cmirrord[4858]: Startup list:
Apr 27 10:21:49 rhel6-node1 cmirrord[4858]: DM_ULOG_RESUME/983
Apr 27 10:21:49 rhel6-node1 cmirrord[4858]: Command History:
Apr 27 10:21:49 rhel6-node1 cmirrord[4858]: 95:1) +++ UUID=Scxvd5CN 1 join +++
Apr 27 10:21:49 rhel6-node1 cmirrord[4858]: 96:2) SEQ#=983, UUID=Scxvd5CN, TYPE=DM_ULOG_RESUME, ORIG=1, RESP=NO
Apr 27 10:21:49 rhel6-node1 cmirrord[4858]: 97:3) [Scxvd5CN] CHECKPOINT_READY notification from 2
Apr 27 10:21:49 rhel6-node1 cmirrord[4858]: 98:4) [Scxvd5CN] Failed to import checkpoint from 2
Apr 27 10:21:49 rhel6-node1 cmirrord[4858]: 99:5) SEQ#=2, UUID=Scxvd5CN, TYPE=DM_ULOG_CHECKPOINT_READY, ORIG=1, RESP=NO
Apr 27 10:22:04 rhel6-node1 kernel: device-mapper: dm-log-userspace: [Scxvd5CN] Request timed out: [5/983] - retrying
Apr 27 10:22:19 rhel6-node1 kernel: device-mapper: dm-log-userspace: [Scxvd5CN] Request timed out: [5/985] - retrying
Environment
- Red Hat Enterprise Linux (RHEL) 6 with the Resilient Storage Add On
- Red Hat Enterprise Linux (RHEL) 6 with the Resilient Storage Add On
lvm2-cluster
locking_type = 3
in/etc/lvm/lvm.conf
- One or more volume group with the clustered attribute set
cmirror
- One or more mirrored logical volumes in a clustered volume group
Subscriber exclusive content
A Red Hat subscription provides unlimited access to our knowledgebase, tools, and much more.