Automated daily etcd-backup on OCP 4

Latest response

So I followed https://docs.openshift.com/container-platform/4.3/backup_and_restore/backing-up-etcd.html and created a script for automating the etcd-backup.

First I created a ssh key pair for a user on a "management host" and then I added the public key to 99-master-ssh machineconfig.

This scripts is executed as the user from a cronjob daily:

cd /var/backup/openshift
ssh core@etcd-0.mydomain.com "sudo /usr/local/bin/etcd-snapshot-backup.sh ./assets/backup/snapshot.db" | exit 1
scp -r core@etcd-0.mydomain.com:./assets/backup/snapshot.db ./ | exit 2
mv snapshot.db snapshot-$(date +%Y-%m-%d).db

The "management host" is then backed up by the central backup software.

Any other/better suggestions how to do this?

Responses

Thanks, looks promising ...

could you tell me how to back up entire openshift v4.3

I added a comment to Backing up ETCD data with OpenShift Container Platform 4.x There have been changes in v4.4.3

etcd-snapshot-backup.sh was probably renamed to cluster-backup.sh some time. It also backs up static resources additionally to the etcd snapshot. both parts required for disaster recovery.

OpenShift 4.6 - Backing up etcd

https://github.com/lgchiaretto/openshift4-backup-automation/

Check my script. You just need a host who can reach the master-nodes and a local-user with an functional ssh-key-access.

#!/bin/bash
#Sascha Gruen - 12.01.2021
#Script runs an etcd-backup on one master-node

LOCAL_BACKUPDIR="/var/backup/openshift"
LOCAL_USER="ocp_mgmt"
REMOTE_BACKUPDIR="/var/home/core/backup/"
REMOTE_USER="core"
#Master-FQDNs
REMOTE_HOST_LIST="master1.cluster master2.cluster master3.cluster"

#Search a functional master-node
for host in $REMOTE_HOST_LIST; do
  sudo su - $LOCAL_USER -c "ssh -o BatchMode=yes -o ConnectTimeout=5 $REMOTE_USER@$host exit"
  if [[ $? -eq 0 ]]; then
    echo "Use node $host"
    REMOTE_HOST=$host
    break;
  else
    echo "ERROR: $host not available"
  fi
done
if [[ $REMOTE_HOST == "" ]]; then echo "ERROR: no host available"; exit 1; fi

#Create backupdirectoy on master node
sudo su - $LOCAL_USER -c "ssh $REMOTE_USER@$REMOTE_HOST 'mkdir -p $REMOTE_BACKUPDIR'" || (echo "ERROR: Could not create backupdir" && exit 2)

#Execute backup
sudo su - $LOCAL_USER -c "ssh $REMOTE_USER@$REMOTE_HOST 'sudo /usr/local/bin/cluster-backup.sh $REMOTE_BACKUPDIR'" || (echo "ERROR: Backup failed" && exit 3)

#Change ownership from root to core
sudo su - $LOCAL_USER -c "ssh $REMOTE_USER@$REMOTE_HOST 'sudo chown -R $REMOTE_USER $REMOTE_BACKUPDIR'" || (echo "ERROR: Change ownership failed" && exit 4)

#Copy to local machine and delete on remote host
sudo su - $LOCAL_USER -c "rsync -av --remove-source-files -e ssh $REMOTE_USER@$REMOTE_HOST:$REMOTE_BACKUPDIR $LOCAL_BACKUPDIR" || (echo "ERROR: RSYNC failed" && exit 4)

Hello Sasha Gruen,

I have created automation to execute this directly on OCP avoiding SSH connections.

You can see the code here https://github.com/lgchiaretto/openshift4-backup-automation/

Hi Luiz,

that seems also be a good solution, but you still have to copythe backup via ssh/scp from the node, or am i missing something? In worst case (all nodes are unrepairable gone), i wanted to have the backup stored outside the cluster. Probably you could add an external NFS-Share as PV to store the backup?

kind regards, Sascha

Sascha Gruen,

Actually, you don't need to copy the backups outside the cluster. You can do it if you'd like but it's not mandatory. According to the docs https://access.redhat.com/documentation/en-us/openshift_container_platform/4.7/html-single/backup_and_restore/index#disaster-recovery and KCS https://access.redhat.com/solutions/5599961 you must have SSH access to at least one master node to restore your cluster. It means that if you lose all of your masters, unfortunately, you lose your cluster. If you have SSH access to only one master node you can restore the cluster through backup. On automation, I am creating a backup on all masters and keeping only the latest one on each but you can change to keep how many you'd like but be careful with the size of the master node partition.

Hi Luiz,

Oh, that`s hard, I didnt know that losing all master machines means that i lose the whole cluster. I thought i could reuse the manifests (with same infraID), create new ignition-files, reinstall the cluster and then recover from etcd-backup. Isnt this working? I read https://access.redhat.com/solutions/5599961, but i still leads me to another solution, which isnt discused That leads me to another question. Can i gracefully shutdown my cluster (https://docs.openshift.com/container-platform/4.7/backup_and_restore/graceful-cluster-shutdown.html#graceful-shutdown-cluster), backup the master-machines while they turned off (e.g. copy the vmdks to a safe place) and use them later as starting point for my etcd-recovery? At least while the backed up vmkds and etcd-backup where on the same minor/micro-version?

Hello Sascha,

" I didnt know that losing all master machines means that i lose the whole cluster. I thought i could reuse the manifests (with same infraID), create new ignition-files, reinstall the cluster and then recover from etcd-backup. Isnt this working?"

No, it's not working and there's no way to do this.

"Can i gracefully shutdown my cluster (https://docs.openshift.com/container-platform/4.7/backup_and_restore/graceful-cluster-shutdown.html#graceful-shutdown-cluster), backup the master-machines while they turned off (e.g. copy the vmdks to a safe place) and use them later as starting point for my etcd-recovery?"

It will not work too and even if you find a way to do this it definitely will not be supported by Red Hat.

As I said to you, the KCS https://access.redhat.com/solutions/5599961 there's an RFE (Request for Enhancement) to do this but there's no date to be released. The only way to recover your cluster today is with, at least, one master of the current cluster.