Select Your Language

Infrastructure and Management

Cloud Computing

Storage

Runtimes

Integration and Automation

  • Comments
  • Recovering OKD Cluster After Long Downtime – TLS Certificates Expired

    Posted on

    Hello community,

    I am facing a challenging situation with my OKD cluster and would appreciate guidance from experienced members.

    Server Version: 4.19.0-okd-scos.6
    Kubernetes Version: v1.32.5-dirty

    Background:

    • The cluster was down for an extended period.
    • During this time, many TLS certificates (API server, kubelet, kube-controller-manager, console, etc.) expired.
    • As a result, the components cannot communicate properly anymore.

    What I have attempted:

    • I manually regenerated certificates for various components (API server, kubelet, kube-controller-manager).
    • I updated the corresponding secrets and static pod certificates.
    • I temporarily restored some functionality by manually updating CA bundles in config maps (e.g.,
      kubelet-serving-ca
      ) and restarting pods.
    • Some components, like the OpenShift console, now still fail to connect due to expired client certificates or CA mismatches.

    Challenges:

    • It is unclear which certificates must be updated first to restore proper communication.
    • Some secrets are automatically recreated by operators, overwriting manual changes.
    • Directly accessing logs via
      kubectl
      or
      oc logs
      often fails due to certificate or authorization errors.
    • The cluster has a mix of static pods and operator-managed resources, making manual intervention complex.

    Goal:
    I would like to know the recommended or supported procedure for recovering an OKD cluster that has been down for a long period and now suffers widespread certificate expiration. Specifically:

    1. Which certificates must be rotated first for a minimal viable cluster operation?
    2. How to safely regenerate client and server certificates without causing further conflicts with operators?
    3. Best practices for updating CA bundles so that all components trust new certificates.
    4. How to avoid manual interventions being overwritten by operators.

    Any guidance, documentation references, or examples from similar recovery scenarios would be immensely helpful.

    Thank you in advance for your support!

    by

    points

    Responses

    Red Hat LinkedIn YouTube Facebook X, formerly Twitter

    Quick Links

    Help

    Site Info

    Related Sites

    © 2025 Red Hat