Metrics server is failing in OCP 3 with TLS handshake error messages
Environment
- Red Hat Openshift Container Platform (RHOCP)
- 3.11
Issue
- The
metrics-server-certs
secret is missing or contains expired certificates. - How to regenerate
metric-server-certs
secret? -
The following
TLS handshake error
errors are shown in themetrics-server
pod:logs.go:41] http: TLS handshake error from x.x.x.x.1:40XXX: remote error: tls: bad certificate
-
Commands
oc adm top nodes
andoc adm top pods
fail with the following errors:Error from server (ServiceUnavailable): the server is currently unable to handle the request (get nodes.metrics.k8s.io)
Error from server (ServiceUnavailable): the server is currently unable to handle the request (get pods.metrics.k8s.io)
Resolution
Check if the metrics-server-certs
is expired as shown in the Diagnostic Steps section. If the certificate is not expired or missing, refer to KCS 4492031: Metrics server and HPA not working due to misconfigured MTU size.
If the certificate is expired, the easiest way to ensure the metrics server deployment has up-to-date certificates is to reinstall it:
-
Delete the metrics-server certificate secret:
oc delete secret metrics-server-certs -n openshift-metrics-server
-
Uninstall the metrics server:
$ ansible-playbook -i <path to inventory file> /usr/share/ansible/openshift-ansible/playbooks/metrics-server/config.yml -e openshift_metrics_server_install=false
-
Install the metrics server again:
$ ansible-playbook -i <path to inventory file> /usr/share/ansible/openshift-ansible/playbooks/metrics-server/config.yml -e openshift_metrics_server_install=true
Root Cause
A missing of expired metrics-server-certs
certificate is causing the oc adm top
commands and the HPA to fail.
Diagnostic Steps
Verify the errors shown when running oc adm top nodes
or oc adm top pods
:
$ oc adm top nodes
Error from server (ServiceUnavailable): the server is currently unable to handle the request (get nodes.metrics.k8s.io)
$ oc adm top pods
Error from server (ServiceUnavailable): the server is currently unable to handle the request (get pods.metrics.k8s.io)
The metrics-server
pods looks ok when executing oc get pods -n openshift-metrics-server
, but there are errors in the logs:
$ oc logs metrics-server-xxxxxxx-xxx -n openshift-metrics-server
[...]
I0101 00:00:00.000000 1 logs.go:41] http: TLS handshake error from 10.0.0.1:44196: read tcp 10.0.0.17:8443->10.0.0.1:44196: read: connection timed out
[...]
Confirm if the metric-server-certs
secret is missing by listing the secrets in the namespace:
$ oc get secrets -n openshift-metrics-server
Check the metric-server-certs
certificate to see if it's valid or not:
$ oc get secret metrics-server-certs -n openshift-metrics-server --template='{{index .data "tls.crt"}}' | base64 -d | openssl x509 -noout -issuer -dates
The audit
logs configured in the cluster can be analyzed to confirm if the secret was accidentally deleted by a user.
This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.
Comments