Recovering OKD Cluster After Long Downtime – TLS Certificates Expired
Hello community,
I am facing a challenging situation with my OKD cluster and would appreciate guidance from experienced members.
Server Version: 4.19.0-okd-scos.6
Kubernetes Version: v1.32.5-dirty
Background:
- The cluster was down for an extended period.
- During this time, many TLS certificates (API server, kubelet, kube-controller-manager, console, etc.) expired.
- As a result, the components cannot communicate properly anymore.
What I have attempted:
- I manually regenerated certificates for various components (API server, kubelet, kube-controller-manager).
- I updated the corresponding secrets and static pod certificates.
- I temporarily restored some functionality by manually updating CA bundles in config maps (e.g.,
kubelet-serving-ca
- Some components, like the OpenShift console, now still fail to connect due to expired client certificates or CA mismatches.
Challenges:
- It is unclear which certificates must be updated first to restore proper communication.
- Some secrets are automatically recreated by operators, overwriting manual changes.
- Directly accessing logs via
kubectl
oc logs
- The cluster has a mix of static pods and operator-managed resources, making manual intervention complex.
Goal:
I would like to know the recommended or supported procedure for recovering an OKD cluster that has been down for a long period and now suffers widespread certificate expiration. Specifically:
- Which certificates must be rotated first for a minimal viable cluster operation?
- How to safely regenerate client and server certificates without causing further conflicts with operators?
- Best practices for updating CA bundles so that all components trust new certificates.
- How to avoid manual interventions being overwritten by operators.
Any guidance, documentation references, or examples from similar recovery scenarios would be immensely helpful.
Thank you in advance for your support!
Responses