Ceph: The Ceph-MGR may leave many connections in CLOSE-WAIT status. File Descriptor (FD) Leak.
Issue
The Ceph-MGR may leave many connections in CLOSE-WAIT status. File Descriptor (FD) Leak.
This issue was reported to Red Hat as: "The Restful API goes unresponsive until active MGR is failed over or restarted".
1.) Investigation of the issue found thousands of "CLOSE-WAIT
" connections associated with the Ceph MGR
# ss -peaonmi | grep "CLOSE-WAIT"
[....]
tcp CLOSE-WAIT 328 0 10.68.31.41:8443 10.68.31.3:43428 users:(("ceph-mgr",pid=6591,fd=7051)) uid:167 ino:100729770 sk:28e47 -->
tcp CLOSE-WAIT 328 0 10.68.31.41:8443 10.68.31.4:33658 users:(("ceph-mgr",pid=6591,fd=26962)) uid:167 ino:109079723 sk:293d2 -->
tcp CLOSE-WAIT 328 0 10.68.31.41:8443 10.68.31.3:54855 users:(("ceph-mgr",pid=6591,fd=22457)) uid:167 ino:106448541 sk:28e48 -->
tcp CLOSE-WAIT 328 0 10.68.31.41:8443 10.68.31.3:38439 users:(("ceph-mgr",pid=6591,fd=1991)) uid:167 ino:99138254 sk:28e49 -->
[....]
# ss -peaonmi | grep "CLOSE-WAIT" -c
10900
With this many connections allocated, the Ceph MGR is unable to service any new connection requests
2.) Most of the connections were from the 2 F5 Load Balancer IP Addresses [1]:
# ss -peaonmi | grep ":8443" | grep "CLOSE-WAIT" | awk '{print $6}' | awk -F: '{print $1}' | sort | uniq -c
114 10.xx.yy.126
5374 10.xx.yy.3 [1]
5399 10.xx.yy.4 [1]
13 10.xx.yy.41
3.) Enabling DEBUG logging for the Ceph MGR and Ceph MGR Dashboard Module:
# ceph config set mgr debug_mgr 20
# ceph config set mgr mgr/dashboard/log_level debug
4.) This article details how to diagnose "CLOSE-WAIT
" for any product/service: Large number of CLOSE_WAIT sockets seen in "netstat" or "ss".
Environment
Red Hat Ceph Storage (RHCS) all versions
Subscriber exclusive content
A Red Hat subscription provides unlimited access to our knowledgebase, tools, and much more.