Automated daily etcd-backup on OCP 4

Latest response

So I followed https://docs.openshift.com/container-platform/4.3/backup_and_restore/backing-up-etcd.html and created a script for automating the etcd-backup.

First I created a ssh key pair for a user on a "management host" and then I added the public key to 99-master-ssh machineconfig.

This scripts is executed as the user from a cronjob daily:

cd /var/backup/openshift
ssh core@etcd-0.mydomain.com "sudo /usr/local/bin/etcd-snapshot-backup.sh ./assets/backup/snapshot.db" | exit 1
scp -r core@etcd-0.mydomain.com:./assets/backup/snapshot.db ./ | exit 2
mv snapshot.db snapshot-$(date +%Y-%m-%d).db

The "management host" is then backed up by the central backup software.

Any other/better suggestions how to do this?

Responses

Thanks, looks promising ...

could you tell me how to back up entire openshift v4.3

I added a comment to Backing up ETCD data with OpenShift Container Platform 4.x There have been changes in v4.4.3

etcd-snapshot-backup.sh was probably renamed to cluster-backup.sh some time. It also backs up static resources additionally to the etcd snapshot. both parts required for disaster recovery.

OpenShift 4.6 - Backing up etcd

https://github.com/lgchiaretto/openshift4-backup-automation/

Check my script. You just need a host who can reach the master-nodes and a local-user with an functional ssh-key-access.

#!/bin/bash
#Sascha Gruen - 12.01.2021
#Script runs an etcd-backup on one master-node

LOCAL_BACKUPDIR="/var/backup/openshift"
LOCAL_USER="ocp_mgmt"
REMOTE_BACKUPDIR="/var/home/core/backup/"
REMOTE_USER="core"
#Master-FQDNs
REMOTE_HOST_LIST="master1.cluster master2.cluster master3.cluster"

#Search a functional master-node
for host in $REMOTE_HOST_LIST; do
  sudo su - $LOCAL_USER -c "ssh -o BatchMode=yes -o ConnectTimeout=5 $REMOTE_USER@$host exit"
  if [[ $? -eq 0 ]]; then
    echo "Use node $host"
    REMOTE_HOST=$host
    break;
  else
    echo "ERROR: $host not available"
  fi
done
if [[ $REMOTE_HOST == "" ]]; then echo "ERROR: no host available"; exit 1; fi

#Create backupdirectoy on master node
sudo su - $LOCAL_USER -c "ssh $REMOTE_USER@$REMOTE_HOST 'mkdir -p $REMOTE_BACKUPDIR'" || (echo "ERROR: Could not create backupdir" && exit 2)

#Execute backup
sudo su - $LOCAL_USER -c "ssh $REMOTE_USER@$REMOTE_HOST 'sudo /usr/local/bin/cluster-backup.sh $REMOTE_BACKUPDIR'" || (echo "ERROR: Backup failed" && exit 3)

#Change ownership from root to core
sudo su - $LOCAL_USER -c "ssh $REMOTE_USER@$REMOTE_HOST 'sudo chown -R $REMOTE_USER $REMOTE_BACKUPDIR'" || (echo "ERROR: Change ownership failed" && exit 4)

#Copy to local machine and delete on remote host
sudo su - $LOCAL_USER -c "rsync -av --remove-source-files -e ssh $REMOTE_USER@$REMOTE_HOST:$REMOTE_BACKUPDIR $LOCAL_BACKUPDIR" || (echo "ERROR: RSYNC failed" && exit 4)

Hello Sasha Gruen,

I have created automation to execute this directly on OCP avoiding SSH connections.

You can see the code here https://github.com/lgchiaretto/openshift4-backup-automation/

Hi Luiz,

that seems also be a good solution, but you still have to copythe backup via ssh/scp from the node, or am i missing something? In worst case (all nodes are unrepairable gone), i wanted to have the backup stored outside the cluster. Probably you could add an external NFS-Share as PV to store the backup?

kind regards, Sascha

Hello Luiz, incurring the issue while creating pod with the error "Error from server (BadRequest): container "container-00" in pod "xyz-debug" is not available"

Really nice and clean solution, thanks!

Is there any way to monitor if the job fails?

Sascha Gruen,

Actually, you don't need to copy the backups outside the cluster. You can do it if you'd like but it's not mandatory. According to the docs https://access.redhat.com/documentation/en-us/openshift_container_platform/4.7/html-single/backup_and_restore/index#disaster-recovery and KCS https://access.redhat.com/solutions/5599961 you must have SSH access to at least one master node to restore your cluster. It means that if you lose all of your masters, unfortunately, you lose your cluster. If you have SSH access to only one master node you can restore the cluster through backup. On automation, I am creating a backup on all masters and keeping only the latest one on each but you can change to keep how many you'd like but be careful with the size of the master node partition.

Hi Luiz,

Oh, that`s hard, I didnt know that losing all master machines means that i lose the whole cluster. I thought i could reuse the manifests (with same infraID), create new ignition-files, reinstall the cluster and then recover from etcd-backup. Isnt this working? I read https://access.redhat.com/solutions/5599961, but i still leads me to another solution, which isnt discused That leads me to another question. Can i gracefully shutdown my cluster (https://docs.openshift.com/container-platform/4.7/backup_and_restore/graceful-cluster-shutdown.html#graceful-shutdown-cluster), backup the master-machines while they turned off (e.g. copy the vmdks to a safe place) and use them later as starting point for my etcd-recovery? At least while the backed up vmkds and etcd-backup where on the same minor/micro-version?

Hello Sascha,

" I didnt know that losing all master machines means that i lose the whole cluster. I thought i could reuse the manifests (with same infraID), create new ignition-files, reinstall the cluster and then recover from etcd-backup. Isnt this working?"

No, it's not working and there's no way to do this.

"Can i gracefully shutdown my cluster (https://docs.openshift.com/container-platform/4.7/backup_and_restore/graceful-cluster-shutdown.html#graceful-shutdown-cluster), backup the master-machines while they turned off (e.g. copy the vmdks to a safe place) and use them later as starting point for my etcd-recovery?"

It will not work too and even if you find a way to do this it definitely will not be supported by Red Hat.

As I said to you, the KCS https://access.redhat.com/solutions/5599961 there's an RFE (Request for Enhancement) to do this but there's no date to be released. The only way to recover your cluster today is with, at least, one master of the current cluster.

Hello Luiz, incurring the issue while creating pod with the error "Error from server (BadRequest): container "container-00" in pod "xyz-debug" is not available"

Hello Moizuddin Aslam,

do you have defaultNodeSelector configured in your cluster?

Hi Luiz, Below is the current node selector policy.. spec: mastersSchedulable: false policy: name: ""

Moizuddin Aslam, Do you have any certificates to approve? run the command "of get csr" to check it.

Dear Luiz, No pending certificates to approve, I’ve did this to other clusters and result is same.

Moizuddin Aslam, did you checked the size of /sysroot partition in your master nodes?

I have the same problem as Moizuddin Aslam. I have enough space on /sysroot. Do you have another suggestion on what can be wrong?

I have the same problem as Moizuddin Aslam. I have enough space on /sysroot. Do you have another suggestion on what can be wrong?

Hello Trine Lise Åvik,

I think you are getting a problem connecting to your node using "oc debug node/" command. Try to execute an oc debug node in one master manually (using the command "oc debug node/master-name") and check if you have a problem with that. If you have a problem I really suggest you open a case at Red Hat support to try to identify the root cause of your problem.