Openshift APIService Slow: ResponseCode: 429 Too many requests
Environment
- Red Hat OpenShift Container Platform (RHOCP)
- 4
- Red Hat OpenShift Service on AWS 4 (ROSA)
- 4
- Red Hat OpenShift Dedicated 4 (OSD)
- 4
- Azure Red Hat OpenShift 4 (ARO)
- 4
Issue
- Openshift CLI commands (
oc
) are taking longer time than expected to show the output. -
There are
ResponseCode: 429
Too many requests
errors in thekube-apiserver
log:controller.go:116] loading OpenAPI spec for "v1beta1.metrics.k8s.io" failed with: failed to retrieve openAPI spec, http error: ResponseCode: 429, Body: Too many requests, please try again later.
-
The following errors are shown in the
prometheus-adapter
pods:timeout.go:135] post-timeout activity - time-elapsed
- Multiple namespaces starting with
openshift-build-test-
are stuck interminating
state. -
oc adm top pod
command fails with the following timeout error.Error from server (Timeout): the server was unable to return a response in the time allotted, but may still be processing the request (get pods.metrics.k8s.io)
Resolution
This bug has been resolved in ROSA v4.10.21 or higher via release notes RHBA-2022:5428 - OpenShift Container Platform 4.10.21 bug fix update, Errata RHBA-2022:5428 and Bugzilla 2091902
NOTE: Before proceeding with the resolution, please refer the "Diagnostic Steps" section to identify the root cause.
This issue can be related to the metrics APIservice
(v1beta1.metrics.k8s.io
). However, the v1beta1.metrics.k8s.io
APIService
claims to be available and the underlying prometheus-adapter
pods are in the running
state. If the issue is related to the metrics APIservice
, you will observe "post-timeout activity
" logs in the prometheus-adapter
pods.
Example:
E0615 13:06:15.024109 1 timeout.go:135] post-timeout activity - time-elapsed: 2.634µs, GET "/apis/metrics.k8s.io/v1beta1" result: <nil>
Workaround
Note: For OSD and ROSA, the Red Hat SRE Team is responsible to apply a workaround while the final fix is released. Please open a new support case.
Restarting the prometheus-adapter
pods with error messages makes the APIservice
responsive again. This will also help in clearing the openshift-build-test-xxx
namespaces stuck in terminating
state:
$ oc get pods -n openshift-monitoring | grep prometheus-adapter
prometheus-adapter-xxxxxxxxxx-yyyyyyy 1/1 Running 0 3d9h
prometheus-adapter-yyyyyyyyyy-xxxxxxx 1/1 Running 0 3d9h
$ oc delete pod prometheus-adapter-xxxxxxxxxx-yyyyyyy -n openshift-monitoring
pod "prometheus-adapter-xxxxxxxxxx-yyyyyyy" deleted
If oc adm top pods
command does not work after prometheus-adapter
pods restart, then restart the thanos-querier
pods running under openshift-monitoring
namespace and check if oc adm top pods
starts working:
$ oc get pods -n openshift-monitoring | grep thanos-querier
thanos-querier-xxxxxxxxxx-yyyyyyy 5/5 Running 0 4d8h
thanos-querier-yyyyyyyyyy-xxxxxxx 5/5 Running 0 4d8h
$ oc delete pod thanos-querier-yyyyyyyyyy-xxxxxxx -n openshift-monitoring
pod "thanos-querier-yyyyyyyyyy-xxxxxxx" deleted
$ oc delete pod thanos-querier-xxxxxxxxxx-yyyyyyy -n openshift-monitoring
pod "thanos-querier-xxxxxxxxxx-yyyyyyy" deleted
Root Cause
The post-timeout activity
error logs were observed in the prometheus-adapter
pods due to which the metrics APIservice
(v1beta1.metrics.k8s.io
) was unresponsive and as a result, an API slowness or ResponseCode: 429
Too many requests
was observed while executing oc
commands.
Diagnostic Steps
(1) Describe the namespace stuck in terminating
state and check for the Reason
:
$ oc describe ns openshift-build-test-xxxxxxxx-xxxxx
Name: openshift-build-test-xxxxxxxx-xxxxx
Labels: kubernetes.io/metadata.name=openshift-build-test-xxxxxxxx-xxxxx
Annotations: openshift.io/sa.scc.mcs: s0:c64,c34
openshift.io/sa.scc.supplemental-groups: 1004100000/10000
openshift.io/sa.scc.uid-range: 1004100000/10000
Status: Terminating
Conditions:
Type Status LastTransitionTime Reason Message
---- ------ ------------------ ------ -------
NamespaceDeletionDiscoveryFailure True Mon, 20 Jun 2022 10:17:06 -0400 DiscoveryFailed Discovery failed for some groups, 1 failing: unable to retrieve the complete list of server APIs: metrics.k8s.io/v1beta1: the server has received too many requests and has asked us to try again later.
(2) Check the APIservice
v1beta1.metrics.k8s.io
for its availability. Please note that the service might claims to be available but still has underlying issues:
$ oc get apiservices v1beta1.metrics.k8s.io
NAME SERVICE AVAILABLE AGE
v1beta1.metrics.k8s.io openshift-monitoring/prometheus-adapter True 415d
(3) Check the status and logs of prometheus-adapter
pods. Please note that the status might seems running
but the pods have errors in the logs:
$ oc get pods -n openshift-monitoring -o wide | grep "prometheus-adapter"
prometheus-adapter-xxxxxxxxxx-yyyyyyy 1/1 Running 0 8d x.x.x.x ip-x-x-x-x.ec2.internal <none> <none>
prometheus-adapter-yyyyyyyyyy-xxxxxxx 1/1 Running 0 8d x.x.x.x ip-x-x-x-x.ec2.internal <none> <none>
Pod logs have multiple entries of post-timeout activity
error as shown below:
$ oc logs -n openshift-monitoring prometheus-adapter-xxxxxxxxxx-yyyyyyy
[...]
E0615 13:06:15.024109 1 timeout.go:135] post-timeout activity - time-elapsed: 2.634µs, GET "/apis/metrics.k8s.io/v1beta1" result: <nil>
E0616 02:22:08.215342 1 timeout.go:135] post-timeout activity - time-elapsed: 2.656µs, GET "/apis/metrics.k8s.io/v1beta1" result: <nil>
E0616 18:15:15.086088 1 timeout.go:135] post-timeout activity - time-elapsed: 2.683µs, GET "/apis/metrics.k8s.io/v1beta1" result: <nil>
E0616 22:36:35.021075 1 timeout.go:135] post-timeout activity - time-elapsed: 2.797µs, GET "/apis/metrics.k8s.io/v1beta1" result: <nil>
E0618 03:10:05.075658 1 timeout.go:135] post-timeout activity - time-elapsed: 51.058721ms, GET "/apis/metrics.k8s.io/v1beta1" result: <nil>
[...]
This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.
Comments