Application hanging due to BLOCKED locks

Solution Verified - Updated -

Issue

If a directory contains a lot of files (e.g., hundreds of thousands or millions of files), then listing the contents of that directory takes relatively more time. The self-heal daemon in RHGS has to read the contents of the directory from all the replica copies (i.e., bricks participating in the replica group) in order to compare the directory contents across bricks. When the read process is complete, the self-heal daemon decides the next operation to be performed. This decision is based on the directory comparison and the changelog extended attributes of the replicate volume.

How does the self-heal daemon safely read the entire contents of such a large directory? The directory contents should not be changed further while the healing process occurs, so the self-heal daemon must lock the directory while it works. This lock is called ENTRYLK. The self-heal daemon may also perform a conservative merge of the directory contents from all bricks of the replica group.

Potential complications:

1) If the directory contains a large number of files, holding the lock on the directory while healing can block application I/O.

2) If the mismatch in the contents of the directory is due to an entry (file or directory) having the same name but different gfids in different bricks, then the self-heal daemon will not be able to perform the conservative merge of the entries from different bricks. This is because the same file or directory will have different gfids in different bricks. Since gfid is the unique identifier of any entry in Gluster, the self-heal daemon will leave the directory as it is without being able to heal it.

When these two complications arise at the same time, this causes applications to constantly face a huge delay in their entry operations (such as create, mkdir, delete, etc.) within that directory. The delay is proportional to the number of the entries present in the directory being healed. If millions of entries are present, then it might take several minutes for applications to complete their entry operations.

Environment

  • RHGS-3.1
  • RHS-3.0

Subscriber exclusive content

A Red Hat subscription provides unlimited access to our knowledgebase of over 48,000 articles and solutions.

Current Customers and Partners

Log in for full access

Log In
Close

Welcome! Check out the Getting Started with Red Hat page for quick tours and guides for common tasks.