RHGS - VMs in RHV fail to start with error "Failed to get "write" lock" due a gluster lock

Solution Verified - Updated -

Environment

  • Red Hat Gluster Storage 3.x
  • Red Hat Virtualization 4.x

Issue

  • A VM called "test-db" is unable to start showing the following error
VM test-db is down with error. Exit message: internal error: qemu unexpectedly closed the monitor: 2021-12-27T10:12:45.871283Z qemu-kvm: warning: All CPU(s) up to maxcpus should be described in NUMA config, ability to start up with partial NUMA mappings is obsoleted and will be removed in future
2021-12-27T10:12:46.062384Z qemu-kvm: -device scsi-hd,bus=ua-dacb155e-c783-4ca5-a685-ec8e8444df91.0,channel=0,scsi-id=0,lun=3,drive=drive-ua-cec91477-ed93-4e99-8daa-9c4899cb3cc1,id=ua-cec91477-ed93-4e99-8daa-9c4899cb3cc1,write-cache=on: Failed to get "write" lock
Is another process using the image [/rhev/data-center/mnt/gluster/gluster-1:_Volume1/73451579-82e1-4f3b-bb1c-68c8d3182fe9/images/cec91477-ed93-4e99-8daa-9c4899cb3cc1/80b0b164-2ea4-478b-a4cd-a9f76c1e6efd]?.
  • Locate the image file on gluster volume mount point.
    In this example, gluster volume name is Volume1, it is mounted on /rhev/data-center/mnt/gluster/gluster-1:_Volume1/. The VM test-db image file is VM "test-db" image file /73451579-82e1-4f3b-bb1c-68c8d3182fe9/images/cec91477-ed93-4e99-8daa-9c4899cb3cc1/80b0b164-2ea4-478b-a4cd-a9f76c1e6efd

  • From mount -lon node "gluster-1", show the VM "test-db" image file "/73451579-82e1-4f3b-bb1c-68c8d3182fe9/images/cec91477-ed93-4e99-8daa-9c4899cb3cc1/80b0b164-2ea4-478b-a4cd-a9f76c1e6efd"

gluster-1:/Volume1 on /rhev/data-center/mnt/glusterSD/gluster-1:_Volume1 type fuse.glusterfs (rw,relatime,user_id=0,group_id=0,default_permissions,allow_other,max_read=131072)

and

  • Gluster nodes running RHGS 3.5.2 glusterfs-6.0-37.el7rhgs.x86_64

Resolution

  • Stop the VM and power it off (if not done yet)
  • Then, from one of the gluster nodes run:
# gluster volume clear-locks VOLNAME /full_path_to_the_image_file kind granted inode 0,0-0

Example, for the previous error seen, from gluster node "gluster-1" run:
# gluster volume clear-locks Volume1 /73451579-82e1-4f3b-bb1c-68c8d3182fe9/images/cec91477-ed93-4e99-8daa-9c4899cb3cc1/80b0b164-2ea4-478b-a4cd-a9f76c1e6efd kind granted inode 0,0-0
  • Then start the VM from RHV UI

  • In order to avoid this issue, please, upgrade to RHGS 3.5.5 version or later that fixes lock issues, see Root Cause section below.

Root Cause

  • On workloads that use a lot of fsyncs and have sharding enabled on a gluster replicated volume (typical on VMs), it has been observed that many inodelks can be blocked waiting for a granted inodelk that is never (or takes a lot of time) released.

  • There are several bugs that could explain this issue:

Bug 1943467 - AFR does not release eager locks for fsync when there is contention
Fixed in glusterfs-6.0-57 ( RHGS 3.5.5 with glusterfs-6.0-59 )

Bug 1878179 - [GSS][Gluster 3.5][Hundreds of locks waiting for active locks granted months ago]
Closed as duplicate of bug 1915037

Bug 1915037 - [GSS] Paused VM due to stale locks
Root caused not completely found, but should be minimized by fixes on previous Bug 1943467

Diagnostic Steps

  • Collect a statedump of the gluster volume to check whether a lock exist for that image file of that gluster volume , run from any gluster node:
# gluster volume statedump VOLNAME

The output is saved in /var/run/glusterdirectory by default. Output files are named according to the following conventions:

For brick processes, brick_path.brick_pid.dump.
For volume processes and kill command results, glusterdump-glusterd_pid.dump.timestamp.

Example for the following gluster volume

# gluster volume infor Volume1

Volume Name: Volume1
Type: Replicate
Volume ID: c7e41b5d-d0ac-4765-90e0-cbf5fc19c963
Status: Started
Snapshot Count: 0
Number of Bricks: 1 x 3 = 3
Transport-type: tcp
Bricks:
Brick1: gluster-0:/mnt/data2/2      (Volume1-client-0)
Brick2: gluster-1:/mnt/data2/2       (Volume1-client-1)
Brick3: gluster-2:/mnt/data2/2       (Volume1-client-2)

with the following brick process:

# gluster v status Volume1
Status of volume: Volume1
Gluster process                             TCP Port  RDMA Port  Online  Pid
------------------------------------------------------------------------------
Brick master-0:/mnt/data2/2                 49156     0          Y       2714 
Brick master-1:/mnt/data2/2                 49156     0          Y       2594 
Brick master-2:/mnt/data2/2                 49155     0          Y       2494 

On node gluster-0 , dump file of the brick process is:
mnt-data2-2.2714.dump.1642444970

On node gluster-1 , dump file of the brick process is:
mnt-data2-2.2594.dump.1642444971

On node gluster-2 , dump file of the brick process is:
mnt-data2-2.2494.dump.1642444971

Please look at the Red Hat Gluster Administration guide for a complete explanation about statedumps.

  • On one of the file dumps, we can see that a lock is located on the same image file as indicated on the error message
[xlator.features.locks.Volume-locks.inode]
path=/73451579-82e1-4f3b-bb1c-68c8d3182fe9/images/cec91477-ed93-4e99-8daa-9c4899cb3cc1/80b0b164-2ea4-478b-a4cd-a9f76c1e6efd
mandatory=0
inodelk-count=170
lock-dump.domain.domain=Volume1-replicate-0:self-heal
lock-dump.domain.domain=Volume1-replicate-0
inodelk.inodelk[0](ACTIVE)=type=WRITE, whence=0, start=0, len=0, pid = 18446744073709551610, owner=faffffff, client=0x7f956006f720,
connection-id=CTX_ID:4b587205-8315-40b7-a6fe-093a3c687e3e-GRAPH_ID:0-PID:20233-HOST:gluster-1-PC_NAME:Volume1-client-1-RECON_NO:-0, blocked at 2021-12-16 17:48:34, granted at 2021-10-16 13:51:15
inodelk.inodelk[1](BLOCKED)=type=WRITE, whence=0, start=0, len=0, pid = 39139, owner=382002345e7f0000, client=0x7f956809acb0, connection-id=CTX_ID:8e11de8f-2457-49c5-8be1-54aab86aac8-GRAPH_ID:0-PID:4998-HOST:gluster-2-PC_NAME:Volume1-client—2-RECON_NO:-2, blocked at 2021-10-06 10:17:10

In addition, on the fuse client log "rhev-data-center-mnt-glusterSD-gluster-1:_Volume1.log"
we could see errors like this:

[2021-12-27 03:12:45.366724] E [rpc-clnt.c:183:call_bail] 0-Volume1-client-2: bailing out frame type(GlusterFS 4.x v1), op(INODELK(29)), xid = 0xf02b051b, unique = 38920433763, sent = 2021-12-27 02:42:40.949088, timeout = 1800 for 10.0.0.5:49155
[2021-12-27 03:12:45.366758] E [MSGID: 114031] [client-rpc-fops_v2.c:1345:client4_0_inodelk_cbk] 0-Volume1-client-2: remote operation failed [Transport endpoint is not connected]
[2021-12-27 03:12:45.367267] E [MSGID: 114031] [client-rpc-fops_v2.c:1345:client4_0_inodelk_cbk] 0-Volume1-client-1: remote operation failed [Invalid argument]

An inode lock operation timeout after 30 min (1800 secs)

This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.

Comments