Openshift APIService Slow: ResponseCode: 429 Too many requests

Solution Verified - Updated -

Environment

  • Red Hat OpenShift Container Platform (RHOCP)
    • 4
  • Red Hat OpenShift Service on AWS 4 (ROSA)
    • 4
  • Red Hat OpenShift Dedicated 4 (OSD)
    • 4
  • Azure Red Hat OpenShift 4 (ARO)
    • 4

Issue

  • Openshift CLI commands (oc) are taking longer time than expected to show the output.
  • There areResponseCode: 429 Too many requests errors in the kube-apiserver log:

    controller.go:116] loading OpenAPI spec for "v1beta1.metrics.k8s.io" failed with: failed to retrieve openAPI spec, http error: ResponseCode: 429, Body: Too many requests, please try again later.
    
  • The following errors are shown in the prometheus-adapter pods:

    timeout.go:135] post-timeout activity - time-elapsed
    
  • Multiple namespaces starting with openshift-build-test- are stuck in terminating state.
  • oc adm top pod command fails with the following timeout error.

    Error from server (Timeout): the server was unable to return a response in the time allotted, but may still be processing the request (get pods.metrics.k8s.io)
    

Resolution

This bug has been resolved in ROSA v4.10.21 or higher via release notes RHBA-2022:5428 - OpenShift Container Platform 4.10.21 bug fix update, Errata RHBA-2022:5428 and Bugzilla 2091902

NOTE: Before proceeding with the resolution, please refer the "Diagnostic Steps" section to identify the root cause.

This issue can be related to the metrics APIservice (v1beta1.metrics.k8s.io). However, the v1beta1.metrics.k8s.io APIService claims to be available and the underlying prometheus-adapter pods are in the running state. If the issue is related to the metrics APIservice, you will observe "post-timeout activity" logs in the prometheus-adapter pods.

Example:

E0615 13:06:15.024109       1 timeout.go:135] post-timeout activity - time-elapsed: 2.634µs, GET "/apis/metrics.k8s.io/v1beta1" result: <nil>

Workaround

Note: For OSD and ROSA, the Red Hat SRE Team is responsible to apply a workaround while the final fix is released. Please open a new support case.

Restarting the prometheus-adapter pods with error messages makes the APIservice responsive again. This will also help in clearing the openshift-build-test-xxx namespaces stuck in terminating state:

$ oc get pods -n openshift-monitoring | grep prometheus-adapter
prometheus-adapter-xxxxxxxxxx-yyyyyyy            1/1     Running   0          3d9h
prometheus-adapter-yyyyyyyyyy-xxxxxxx            1/1     Running   0          3d9h

$ oc delete pod prometheus-adapter-xxxxxxxxxx-yyyyyyy -n openshift-monitoring
pod "prometheus-adapter-xxxxxxxxxx-yyyyyyy" deleted

If oc adm top pods command does not work after prometheus-adapter pods restart, then restart the thanos-querier pods running under openshift-monitoring namespace and check if oc adm top pods starts working:

$ oc get pods -n openshift-monitoring | grep thanos-querier
thanos-querier-xxxxxxxxxx-yyyyyyy                5/5     Running   0          4d8h
thanos-querier-yyyyyyyyyy-xxxxxxx                5/5     Running   0          4d8h

$ oc delete pod thanos-querier-yyyyyyyyyy-xxxxxxx -n openshift-monitoring
pod "thanos-querier-yyyyyyyyyy-xxxxxxx" deleted

$ oc delete pod thanos-querier-xxxxxxxxxx-yyyyyyy -n openshift-monitoring
pod "thanos-querier-xxxxxxxxxx-yyyyyyy" deleted

Root Cause

The post-timeout activity error logs were observed in the prometheus-adapter pods due to which the metrics APIservice (v1beta1.metrics.k8s.io) was unresponsive and as a result, an API slowness or ResponseCode: 429 Too many requests was observed while executing oc commands.

Diagnostic Steps

(1) Describe the namespace stuck in terminating state and check for the Reason:

$ oc describe ns openshift-build-test-xxxxxxxx-xxxxx
                                                      
Name:         openshift-build-test-xxxxxxxx-xxxxx
Labels:       kubernetes.io/metadata.name=openshift-build-test-xxxxxxxx-xxxxx
Annotations:  openshift.io/sa.scc.mcs: s0:c64,c34
              openshift.io/sa.scc.supplemental-groups: 1004100000/10000
              openshift.io/sa.scc.uid-range: 1004100000/10000
Status:       Terminating
Conditions:
  Type                                         Status  LastTransitionTime               Reason                  Message
  ----                                         ------  ------------------               ------                  -------
  NamespaceDeletionDiscoveryFailure            True    Mon, 20 Jun 2022 10:17:06 -0400  DiscoveryFailed         Discovery failed for some groups, 1 failing: unable to retrieve the complete list of server APIs: metrics.k8s.io/v1beta1: the server has received too many requests and has asked us to try again later.

(2) Check the APIservice v1beta1.metrics.k8s.io for its availability. Please note that the service might claims to be available but still has underlying issues:

$ oc get apiservices v1beta1.metrics.k8s.io
NAME                     SERVICE                                   AVAILABLE   AGE
v1beta1.metrics.k8s.io   openshift-monitoring/prometheus-adapter   True        415d

(3) Check the status and logs of prometheus-adapter pods. Please note that the status might seems running but the pods have errors in the logs:

$ oc get pods -n openshift-monitoring -o wide | grep "prometheus-adapter" 
prometheus-adapter-xxxxxxxxxx-yyyyyyy                1/1     Running     0             8d      x.x.x.x     ip-x-x-x-x.ec2.internal    <none>           <none>
prometheus-adapter-yyyyyyyyyy-xxxxxxx                1/1     Running     0             8d      x.x.x.x     ip-x-x-x-x.ec2.internal   <none>           <none> 

Pod logs have multiple entries of post-timeout activity error as shown below:

$ oc logs -n openshift-monitoring prometheus-adapter-xxxxxxxxxx-yyyyyyy
[...]
E0615 13:06:15.024109       1 timeout.go:135] post-timeout activity - time-elapsed: 2.634µs, GET "/apis/metrics.k8s.io/v1beta1" result: <nil>
E0616 02:22:08.215342       1 timeout.go:135] post-timeout activity - time-elapsed: 2.656µs, GET "/apis/metrics.k8s.io/v1beta1" result: <nil>
E0616 18:15:15.086088       1 timeout.go:135] post-timeout activity - time-elapsed: 2.683µs, GET "/apis/metrics.k8s.io/v1beta1" result: <nil>
E0616 22:36:35.021075       1 timeout.go:135] post-timeout activity - time-elapsed: 2.797µs, GET "/apis/metrics.k8s.io/v1beta1" result: <nil>
E0618 03:10:05.075658       1 timeout.go:135] post-timeout activity - time-elapsed: 51.058721ms, GET "/apis/metrics.k8s.io/v1beta1" result: <nil>
[...]

This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.

Comments