Ceph: RGW failing S3 transaction with HTTP 503 response

Solution Verified - Updated -

Issue

  • RGW failing S3 transaction with HTTP 503 response
  • From the RGW logs:
2020-06-23 09:11:59.661 7fe7087d7700 10 req 118010 0.000s s3:list_buckets scheduling with dmclock client=3 cost=1
2020-06-23 09:11:59.661 7fe7087d7700  0 req 118010 0.000s s3:list_buckets Scheduling request failed with -2218  // #define ERR_RATE_LIMITED        2218

Nov 07 19:45:04 data-xx-08 ceph-fcd6677e-xx-yy-zz-e0d55e53cea4-rgw-ssl-data-xx-08-sxdklt[14380]: 2022-11-07T19:45:04.347+0000 7f26afa26700  1 beast: 0x7f268655c600: 172.31.100.2 - - [07/Nov/2022:19:45:04.346 +0000] "GET /data/804/6a558494-xx-yy-zz-9297eb9bfeb4/d/06883/794 HTTP/1.1" 503 185 - latency=0.001000011s

Nov 07 19:45:04 data-xx-08 ceph-fcd6677e-xx-yy-zz-e0d55e53cea4-rgw-ssl-data-xx-08-sxdklt[14380]: 2022-11-07T19:45:04.487+0000 7f26afa26700  1 beast: 0x7f268655c600: 172.31.100.2 - - [07/Nov/2022:19:45:04.486 +0000] "GET /data/346/6a558494-xx-yy-zz-9297eb9bfeb4/d/23834/713 HTTP/1.1" 503 185 - latency=0.001000011s

Nov 07 19:45:04 data-xx-08 ceph-fcd6677e-xx-yy-zz-e0d55e53cea4-rgw-ssl-data-xx-08-sxdklt[14380]: 2022-11-07T19:45:04.588+0000 7f26afa26700  1 beast: 0x7f268655c600: 172.31.100.2 - - [07/Nov/2022:19:45:04.587 +0000] "GET /data/804/6a558494-xx-yy-zz-9297eb9bfeb4/d/06883/794 HTTP/1.1" 503 185 - latency=0.000000000s
  • Excessive number of CLOSE-WAIT connections to the RGW
[data-xx-08 ~]# ss -anp | grep radosgw | grep "CLOSE-WAIT" | head
tcp   CLOSE-WAIT 25     0                                                                    172.31.100.168:444                   172.31.100.20:48100    users:(("radosgw",pid=2406701,fd=1645))
tcp   CLOSE-WAIT 25     0                                                                    172.31.100.168:444                    172.31.100.1:33928    users:(("radosgw",pid=2406701,fd=1292))
tcp   CLOSE-WAIT 25     0                                                                    172.31.100.168:444                   172.31.100.20:34637    users:(("radosgw",pid=2406701,fd=1421))
tcp   CLOSE-WAIT 25     0                                                                    172.31.100.168:444                   172.31.100.20:33513    users:(("radosgw",pid=2406701,fd=1206))
tcp   CLOSE-WAIT 25     0                                                                    172.31.100.168:444                   172.31.100.20:47564    users:(("radosgw",pid=2406701,fd=2512))
tcp   CLOSE-WAIT 25     0                                                                    172.31.100.168:444                   172.31.100.20:44240    users:(("radosgw",pid=2406701,fd=1715))
tcp   CLOSE-WAIT 25     0                                                                    172.31.100.168:444                    172.31.100.1:40585    users:(("radosgw",pid=2406701,fd=1699))
tcp   CLOSE-WAIT 25     0                                                                    172.31.100.168:444                   172.31.100.20:45868    users:(("radosgw",pid=2406701,fd=1175))
tcp   CLOSE-WAIT 25     0                                                                    172.31.100.168:444                   172.31.100.20:45845    users:(("radosgw",pid=2406701,fd=934))
tcp   CLOSE-WAIT 25     0                                                                    172.31.100.168:444                   172.31.100.20:54824    users:(("radosgw",pid=2406701,fd=2421))

[data-xx-08 ~]# ss -anp | grep radosgw | grep "CLOSE-WAIT" -c
1019
  • There is an HA Proxy Load Balancer between the application server(s) and the Ceph RGW's (Rados GateWays)
  • The HA Proxy does NOT use the options timeout client and option http-server-close

Environment

Red Hat Ceph Storage (RHCS) 5.x

Subscriber exclusive content

A Red Hat subscription provides unlimited access to our knowledgebase, tools, and much more.

Current Customers and Partners

Log in for full access

Log In

New to Red Hat?

Learn more about Red Hat subscriptions

Using a Red Hat product through a public cloud?

How to access this content