How to restore cluster from etcd DB backup when 2 out of the 3 Master nodes are down?

Solution Verified - Updated 2022-08-08T19:02:01+00:00 -

Issue

At least one Master node is working and others are down
Some master nodes keeps on rebooting
Can occur after a hard shutdown or even after graceful shutdown of nodes
A leftover container on master-01 and master-03, that was shutting down the node as soon as the node started
Container was programmed to run Command:[chroot /host shutdown -h now], now it keeps on shutting down nodes
A pod is stuck and it automatically reboots nodes

Warning

This process should be done only in presence of RedHat technical support.
This KCS is specifically for the case where at least one master node working and one of the pods automatically shutting down 2 other master nodes and there no way to turn them on.
It is not recommended to use Command:[chroot /host shutdown -h now] using debug container because it can lead to similar situation.

Logs

It may take multiple reboots and tries, to collect right journalctl, and discover that there was a leftover container on master-01 and master-03, that was shutting down the node as soon as the node started. See: Command:[chroot /host shutdown -h now]

Mar 08 18:39:30 XX001 hyperkube[1655]: E0308 18:39:30.197365    1655 kuberuntime_manager.go:815] container &Container{Name:container-00,Image:registry.ocp.XX:5000/rhel8/support-tools:8.4-13,Command:[chroot /host shutdown -h now],Args:[],WorkingDir:,Ports:[]ContainerPort{},Env:[]EnvVar{},Resources:ResourceRequirements{Limits:ResourceList{},Requests:ResourceList{},},VolumeMounts:...
...
Mar 08 20:46:04 XX001 hyperkube[1694]: I0308 20:46:04.677849    1694 event.go:291] "Event occurred" object="XXX/XX001-debug" kind="Pod" apiVersion="v1" type="Normal" reason="Created" message="Created container container-00"
Mar 08 20:46:04 XX001 hyperkube[1694]: I0308 20:46:04.854458    1694 event.go:291] "Event occurred" object="XXX/XX001-debug" kind="Pod" apiVersion="v1" type="Normal" reason="Started" message="Started container container-00"
Mar 08 20:46:04 XX001 systemd[1]: Stopping libcontainer container 60284018b471c806214dbbc1a2c51f54c0a5ae9ac0334f2e3d89e11f95c18eff.
Mar 08 20:46:04 XX001 systemd[1]: Removed slice machine.slice.
Mar 08 20:46:04 XX001 systemd[1]: machine.slice: Consumed 224ms CPU time

In above situation, only master-02 was working, but it wasn't enough for etcd quorum as master-01 and master-03 were shutting down after 2-3 minutes.

Environment

Red Hat OpenShift Container Platform 4.x

Subscriber exclusive content

A Red Hat subscription provides unlimited access to our knowledgebase, tools, and much more.

Select Your Language

How to restore cluster from etcd DB backup when 2 out of the 3 Master nodes are down?

Issue

Warning

Logs

Environment

Subscriber exclusive content

Current Customers and Partners

New to Red Hat?

Using a Red Hat product through a public cloud?

Quick Links

Help

Site Info

Related Sites

About

Red Hat legal and privacy links

Red Hat legal and privacy links

Issue

Warning

Logs

Environment

Subscriber exclusive content

Current Customers and Partners

New to Red Hat?

Using a Red Hat product through a public cloud?

Quick Links

Help

Site Info

Related Sites

Systems Status

About

Red Hat legal and privacy links

Red Hat legal and privacy links