How to restore cluster from etcd DB backup when 2 out of the 3 Master nodes are down?

Solution Verified - Updated -

Issue

  • At least one Master node is working and others are down
  • Some master nodes keeps on rebooting
  • Can occur after a hard shutdown or even after graceful shutdown of nodes
  • A leftover container on master-01 and master-03, that was shutting down the node as soon as the node started
  • Container was programmed to run Command:[chroot /host shutdown -h now], now it keeps on shutting down nodes
  • A pod is stuck and it automatically reboots nodes

Warning

  • This process should be done only in presence of RedHat technical support.
  • This KCS is specifically for the case where at least one master node working and one of the pods automatically shutting down 2 other master nodes and there no way to turn them on.
  • It is not recommended to use Command:[chroot /host shutdown -h now] using debug container because it can lead to similar situation.

Logs

It may take multiple reboots and tries, to collect right journalctl, and discover that there was a leftover container on master-01 and master-03, that was shutting down the node as soon as the node started. See: Command:[chroot /host shutdown -h now]

Mar 08 18:39:30 XX001 hyperkube[1655]: E0308 18:39:30.197365    1655 kuberuntime_manager.go:815] container &Container{Name:container-00,Image:registry.ocp.XX:5000/rhel8/support-tools:8.4-13,Command:[chroot /host shutdown -h now],Args:[],WorkingDir:,Ports:[]ContainerPort{},Env:[]EnvVar{},Resources:ResourceRequirements{Limits:ResourceList{},Requests:ResourceList{},},VolumeMounts:...
...
Mar 08 20:46:04 XX001 hyperkube[1694]: I0308 20:46:04.677849    1694 event.go:291] "Event occurred" object="XXX/XX001-debug" kind="Pod" apiVersion="v1" type="Normal" reason="Created" message="Created container container-00"
Mar 08 20:46:04 XX001 hyperkube[1694]: I0308 20:46:04.854458    1694 event.go:291] "Event occurred" object="XXX/XX001-debug" kind="Pod" apiVersion="v1" type="Normal" reason="Started" message="Started container container-00"
Mar 08 20:46:04 XX001 systemd[1]: Stopping libcontainer container 60284018b471c806214dbbc1a2c51f54c0a5ae9ac0334f2e3d89e11f95c18eff.
Mar 08 20:46:04 XX001 systemd[1]: Removed slice machine.slice.
Mar 08 20:46:04 XX001 systemd[1]: machine.slice: Consumed 224ms CPU time

In above situation, only master-02 was working, but it wasn't enough for etcd quorum as master-01 and master-03 were shutting down after 2-3 minutes.

Environment

  • Red Hat OpenShift Container Platform 4.x

Subscriber exclusive content

A Red Hat subscription provides unlimited access to our knowledgebase, tools, and much more.

Current Customers and Partners

Log in for full access

Log In

New to Red Hat?

Learn more about Red Hat subscriptions

Using a Red Hat product through a public cloud?

How to access this content