How to restore cluster from etcd DB backup when 2 out of the 3 Master nodes are down?
Issue
- At least one Master node is working and others are down
- Some master nodes keeps on rebooting
- Can occur after a hard shutdown or even after graceful shutdown of nodes
- A leftover container on master-01 and master-03, that was shutting down the node as soon as the node started
- Container was programmed to run Command:[chroot /host shutdown -h now], now it keeps on shutting down nodes
- A pod is stuck and it automatically reboots nodes
Warning
- This process should be done only in presence of RedHat technical support.
- This KCS is specifically for the case where at least one master node working and one of the pods automatically shutting down 2 other master nodes and there no way to turn them on.
- It is not recommended to use Command:[chroot /host shutdown -h now] using debug container because it can lead to similar situation.
Logs
It may take multiple reboots and tries, to collect right journalctl, and discover that there was a leftover container on master-01 and master-03, that was shutting down the node as soon as the node started. See: Command:[chroot /host shutdown -h now]
Mar 08 18:39:30 XX001 hyperkube[1655]: E0308 18:39:30.197365 1655 kuberuntime_manager.go:815] container &Container{Name:container-00,Image:registry.ocp.XX:5000/rhel8/support-tools:8.4-13,Command:[chroot /host shutdown -h now],Args:[],WorkingDir:,Ports:[]ContainerPort{},Env:[]EnvVar{},Resources:ResourceRequirements{Limits:ResourceList{},Requests:ResourceList{},},VolumeMounts:...
...
Mar 08 20:46:04 XX001 hyperkube[1694]: I0308 20:46:04.677849 1694 event.go:291] "Event occurred" object="XXX/XX001-debug" kind="Pod" apiVersion="v1" type="Normal" reason="Created" message="Created container container-00"
Mar 08 20:46:04 XX001 hyperkube[1694]: I0308 20:46:04.854458 1694 event.go:291] "Event occurred" object="XXX/XX001-debug" kind="Pod" apiVersion="v1" type="Normal" reason="Started" message="Started container container-00"
Mar 08 20:46:04 XX001 systemd[1]: Stopping libcontainer container 60284018b471c806214dbbc1a2c51f54c0a5ae9ac0334f2e3d89e11f95c18eff.
Mar 08 20:46:04 XX001 systemd[1]: Removed slice machine.slice.
Mar 08 20:46:04 XX001 systemd[1]: machine.slice: Consumed 224ms CPU time
In above situation, only master-02 was working, but it wasn't enough for etcd quorum as master-01 and master-03 were shutting down after 2-3 minutes.
Environment
- Red Hat OpenShift Container Platform 4.x
Subscriber exclusive content
A Red Hat subscription provides unlimited access to our knowledgebase, tools, and much more.