How do I make my replicated volumes respond to requests faster when one of the nodes is offlined in Red Hat Storage 3.*?

Solution Verified - Updated -

Environment

  • Red Hat Storage 3.*

Issue

  • I have a replicated pair of Red Hat Storage 3.* servers configured to be highly-available, while copying data one of the servers is rebooted, it takes about 1 minute for the copy the resume.
  • Why does it take about 1 minute to come back?
  • Is it possible to modify the 1 minute timeout behavior?

Resolution

  • Set the network.ping-timeout option to a value of 10 seconds with:
 # gluster volume set <volname> network.ping-timeout 10
  • And remount the volume on the client side by umount and mount.

Root Cause

  • The hang occurs for 42 seconds by default as the default network.ping-timeout value is 42. This value controls how long the gluster client will ping the mount point until it gives up and marks it as timed out.
  • Setting the value to 10 seconds should help the client recover faster, however it's worth noting that setting the above value too low can cause clients to disconnect if servers are at a higher load and not responding to requests in a timely manner.

Diagnostic Steps

  • Issue the following command to view current volume options for the specific volume:
 # gluster volume info all

 Volume Name: testvol
 Type: Replicate
 Volume ID: 3d7b1203-149e-4456-9e5a-aaa42e3c42dc
 Status: Started
 Number of Bricks: 1 x 2 = 2
 Transport-type: tcp
 Bricks:
 Brick1: rhs-1:/gluster/brick1
 Brick2: rhs-2:/gluster/brick1
 Options Reconfigured:
 network.ping-timeout: 10 <---
  • Perform a data copy to the volume, restart one of the nodes to reproduce the hang.

This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.

Comments