Metrics server is failing in OCP 3 with TLS handshake error messages

Solution Verified - Updated -

Environment

  • Red Hat Openshift Container Platform (RHOCP)
    • 3.11

Issue

  • The metrics-server-certs secret is missing or contains expired certificates.
  • How to regenerate metric-server-certs secret?
  • The following TLS handshake error errors are shown in the metrics-server pod:

    logs.go:41] http: TLS handshake error from x.x.x.x.1:40XXX: remote error: tls: bad certificate
    
  • Commands oc adm top nodes and oc adm top pods fail with the following errors:

    Error from server (ServiceUnavailable): the server is currently unable to handle the request (get nodes.metrics.k8s.io)
    
    Error from server (ServiceUnavailable): the server is currently unable to handle the request (get pods.metrics.k8s.io)
    
    

Resolution

Check if the metrics-server-certs is expired as shown in the Diagnostic Steps section. If the certificate is not expired or missing, refer to KCS 4492031: Metrics server and HPA not working due to misconfigured MTU size.

If the certificate is expired, the easiest way to ensure the metrics server deployment has up-to-date certificates is to reinstall it:

  1. Delete the metrics-server certificate secret:

    oc delete secret metrics-server-certs -n openshift-metrics-server
    
  2. Uninstall the metrics server:

    $ ansible-playbook -i <path to inventory file> /usr/share/ansible/openshift-ansible/playbooks/metrics-server/config.yml -e openshift_metrics_server_install=false
    
  3. Install the metrics server again:

    $ ansible-playbook -i <path to inventory file> /usr/share/ansible/openshift-ansible/playbooks/metrics-server/config.yml -e openshift_metrics_server_install=true
    

Root Cause

A missing of expired metrics-server-certs certificate is causing the oc adm top commands and the HPA to fail.

Diagnostic Steps

Verify the errors shown when running oc adm top nodes or oc adm top pods:

$ oc adm top nodes
Error from server (ServiceUnavailable): the server is currently unable to handle the request (get nodes.metrics.k8s.io)

$ oc adm top pods
Error from server (ServiceUnavailable): the server is currently unable to handle the request (get pods.metrics.k8s.io)

The metrics-server pods looks ok when executing oc get pods -n openshift-metrics-server, but there are errors in the logs:

$ oc logs metrics-server-xxxxxxx-xxx -n openshift-metrics-server
[...]
I0101 00:00:00.000000       1 logs.go:41] http: TLS handshake error from 10.0.0.1:44196: read tcp 10.0.0.17:8443->10.0.0.1:44196: read: connection timed out
[...]

Confirm if the metric-server-certs secret is missing by listing the secrets in the namespace:

$ oc get secrets -n openshift-metrics-server

Check the metric-server-certs certificate to see if it's valid or not:

$ oc get secret metrics-server-certs -n openshift-metrics-server --template='{{index .data "tls.crt"}}' | base64 -d | openssl x509 -noout -issuer -dates

The audit logs configured in the cluster can be analyzed to confirm if the secret was accidentally deleted by a user.

This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.

Comments