Ceph: The Ceph-MGR may leave many connections in CLOSE-WAIT status. File Descriptor (FD) Leak.

Solution Verified - Updated -

Issue

The Ceph-MGR may leave many connections in CLOSE-WAIT status. File Descriptor (FD) Leak.

This issue was reported to Red Hat as: "The Restful API goes unresponsive until active MGR is failed over or restarted".

1.) Investigation of the issue found thousands of "CLOSE-WAIT" connections associated with the Ceph MGR

# ss -peaonmi | grep "CLOSE-WAIT"
[....]
tcp   CLOSE-WAIT 328    0                                                                                                       10.68.31.41:8443                  10.68.31.3:43428     users:(("ceph-mgr",pid=6591,fd=7051)) uid:167 ino:100729770 sk:28e47 -->
tcp   CLOSE-WAIT 328    0                                                                                                       10.68.31.41:8443                  10.68.31.4:33658     users:(("ceph-mgr",pid=6591,fd=26962)) uid:167 ino:109079723 sk:293d2 -->
tcp   CLOSE-WAIT 328    0                                                                                                       10.68.31.41:8443                  10.68.31.3:54855     users:(("ceph-mgr",pid=6591,fd=22457)) uid:167 ino:106448541 sk:28e48 -->
tcp   CLOSE-WAIT 328    0                                                                                                       10.68.31.41:8443                  10.68.31.3:38439     users:(("ceph-mgr",pid=6591,fd=1991)) uid:167 ino:99138254 sk:28e49 -->
[....]        

# ss -peaonmi | grep "CLOSE-WAIT" -c
10900

With this many connections allocated, the Ceph MGR is unable to service any new connection requests

2.) Most of the connections were from the 2 F5 Load Balancer IP Addresses [1]:

# ss -peaonmi | grep ":8443" | grep "CLOSE-WAIT" | awk '{print $6}' | awk -F: '{print $1}' | sort | uniq -c
     114 10.xx.yy.126
    5374 10.xx.yy.3   [1]
    5399 10.xx.yy.4   [1]
      13 10.xx.yy.41

3.) Enabling DEBUG logging for the Ceph MGR and Ceph MGR Dashboard Module:

# ceph config set mgr debug_mgr 20
# ceph config set mgr mgr/dashboard/log_level debug

4.) This article details how to diagnose "CLOSE-WAIT" for any product/service: Large number of CLOSE_WAIT sockets seen in "netstat" or "ss".

Environment

Red Hat Ceph Storage (RHCS) all versions

Subscriber exclusive content

A Red Hat subscription provides unlimited access to our knowledgebase, tools, and much more.

Current Customers and Partners

Log in for full access

Log In

New to Red Hat?

Learn more about Red Hat subscriptions

Using a Red Hat product through a public cloud?

How to access this content