Chapter 2. Replacing a failed master host

This document describes the process to replace a single etcd member. This procedure assumes that there is still an etcd quorum in the cluster.

Note

If you have lost the majority of your master hosts, leading to etcd quorum loss, then you must follow the disaster recovery procedure to recover from lost master hosts instead of this procedure.

If the control plane certificates are not valid on the member being replaced, then you must follow the procedure to recover from expired control plane certificates instead of this procedure.

To replace a single master host:

2.1. Removing a failed master host from the etcd cluster

Follow these steps to remove a failed master host from the etcd cluster.

Prerequisites

  • You have access to the cluster as a user with the cluster-admin role.
  • You have SSH access to an active master host.

Procedure

  1. View the list of Pods associated with etcd.

    In a terminal that has access to the cluster, run the following command:

    $ oc get pods -n openshift-etcd
    NAME                                                     READY   STATUS    RESTARTS   AGE
    etcd-member-ip-10-0-128-73.us-east-2.compute.internal    2/2     Running   0          15h
    etcd-member-ip-10-0-147-172.us-east-2.compute.internal   2/2     Running   7          122m
    etcd-member-ip-10-0-171-108.us-east-2.compute.internal   2/2     Running   0          15h
  2. Access an active master host.
  3. Run the etcd-member-remove.sh script and pass in the name of the etcd member to remove:

    [core@ip-10-0-128-73 ~]$ sudo -E /usr/local/bin/etcd-member-remove.sh etcd-member-ip-10-0-147-172.us-east-2.compute.internal
    Downloading etcdctl binary..
    etcdctl version: 3.3.10
    API version: 3.3
    etcd client certs already backed up and available ./assets/backup/
    Member 23e4736df4451b32 removed from cluster 6e25bab1bb556673
    etcd member etcd-member-ip-10-0-147-172.us-east-2.compute.internal with 23e4736df4451b32 successfully removed..
  4. Verify that the etcd member has been successfully removed from the cluster:

    1. Connect to the running etcd container:

      [core@ip-10-0-128-73 ~] id=$(sudo crictl ps --name etcd-member | awk 'FNR==2{ print $1}') && sudo crictl exec -it $id /bin/sh
    2. In the etcd container, export the variables needed for connecting to etcd:

      sh-4.3# export ETCDCTL_API=3 ETCDCTL_CACERT=/etc/ssl/etcd/ca.crt ETCDCTL_CERT=$(find /etc/ssl/ -name *peer*crt) ETCDCTL_KEY=$(find /etc/ssl/ -name *peer*key)
    3. In the etcd container, execute etcdctl member list and verify that the removed member is no longer listed:

      sh-4.3#  etcdctl member list -w table
      
      +------------------+---------+------------------------------------------+------------------------------------------------------------------+---------------------------+
      |        ID        | STATUS  |                   NAME                   |                            PEER ADDRS                            |       CLIENT ADDRS        |
      +------------------+---------+------------------------------------------+------------------------------------------------------------------+---------------------------+
      | 29e461db6be4eaaa | started | etcd-member-ip-10-0-128-73.us-east-2.compute.internal | https://etcd-2.clustername.devcluster.openshift.com:2380 | https://10.0.128.73:2379 |
      |  cbe982c74cbb42f | started |  etcd-member-ip-10-0-171-108.us-east-2.compute.internal | https://etcd-1.clustername.devcluster.openshift.com:2380 |   https://10.0.171.108:2379 |
      +------------------+---------+------------------------------------------+------------------------------------------------------------------+---------------------------+

2.2. Adding the member back to the cluster

After you have removed the member from the etcd cluster, use one of the following procedures to add the member to the cluster:

2.2.1. Adding a master host back to the etcd cluster

Follow these steps to add a master host back to the etcd cluster. This procedure assumes that you previously removed the master host from the cluster and that its etcd dependencies, such as TLS certificates and DNS, are valid.

Prerequisites

  • You have access to the cluster as a user with the cluster-admin role.
  • You have SSH access to the master host to add to the etcd cluster.
  • You have the IP address of an existing active etcd member.

Procedure

  1. Access the master host to add to the etcd cluster.

    Important

    You must run this procedure on the master host that is being added to the etcd cluster.

  2. Run the etcd-member-add.sh script and pass in two parameters:

    • the IP address of an existing etcd member
    • the name of the etcd member to add
    [core@ip-10-0-147-172 ~]$ sudo -E /usr/local/bin/etcd-member-add.sh \
    10.0.128.73 \ 1
    etcd-member-ip-10-0-147-172.us-east-2.compute.internal 2
    
    Downloading etcdctl binary..
    etcdctl version: 3.3.10
    API version: 3.3
    etcd-member.yaml found in ./assets/backup/
    etcd.conf backup upready exists ./assets/backup/etcd.conf
    Stopping etcd..
    Waiting for etcd-member to stop
    etcd data-dir backup found ./assets/backup/etcd..
    Updating etcd membership..
    Removing etcd data_dir /var/lib/etcd..
    
    ETCD_NAME="etcd-member-ip-10-0-147-172.us-east-2.compute.internal"
    ETCD_INITIAL_CLUSTER="etcd-member-ip-10-0-147-172.us-east-2.compute.internal=https://etcd-1.clustername.devcluster.openshift.com:2380,etcd-member-ip-10-0-171-108.us-east-2.compute.internal=https://etcd-2.clustername.devcluster.openshift.com:2380,etcd-member-ip-10-0-128-73.us-east-2.compute.internal=https://etcd-0.clustername.devcluster.openshift.com:2380"
    ETCD_INITIAL_ADVERTISE_PEER_URLS="https://etcd-1.clustername.devcluster.openshift.com:2380"
    ETCD_INITIAL_CLUSTER_STATE="existing"'
    Member  1e42c7070decd39 added to cluster 6e25bab1bb556673
    Starting etcd..
    1
    The IP address of an active etcd member. This is not the IP address of the member that you are adding.
    2
    The name of the etcd member to add.
  3. Verify that the etcd member has been successfully added to the etcd cluster:

    1. Connect to the running etcd container:

      [core@ip-10-0-147-172 ~] id=$(sudo crictl ps --name etcd-member | awk 'FNR==2{ print $1}') && sudo crictl exec -it $id /bin/sh
    2. In the etcd container, export the variables needed for connecting to etcd:

      sh-4.3# export ETCDCTL_API=3 ETCDCTL_CACERT=/etc/ssl/etcd/ca.crt ETCDCTL_CERT=$(find /etc/ssl/ -name *peer*crt) ETCDCTL_KEY=$(find /etc/ssl/ -name *peer*key)
    3. In the etcd container, execute etcdctl member list and verify that the new member is listed:

      sh-4.3#  etcdctl member list -w table
      
      +------------------+---------+------------------------------------------+------------------------------------------------------------------+---------------------------+
      |        ID        | STATUS  |                   NAME                   |                            PEER ADDRS                            |       CLIENT ADDRS        |
      +------------------+---------+------------------------------------------+------------------------------------------------------------------+---------------------------+
      | 29e461db6be4eaaa | started | etcd-member-ip-10-0-128-73.us-east-2.compute.internal | https://etcd-2.clustername.devcluster.openshift.com:2380 | https://10.0.128.73:2379 |
      |  cbe982c74cbb42f | started | etcd-member-ip-10-0-147-172.us-east-2.compute.internal | https://etcd-0.clustername.devcluster.openshift.com:2380 | https://10.0.147.172:2379 |
      | a752f80bcb0da3e8 | started |   etcd-member-ip-10-0-171-108.us-east-2.compute.internal | https://etcd-1.clustername.devcluster.openshift.com:2380 |   https://10.0.171.108:2379 |
      +------------------+---------+------------------------------------------+------------------------------------------------------------------+---------------------------+

      It may take up to 10 minutes for the new member to start.

    4. In the etcd container, execute etcdctl endpoint health and verify that the new member is healthy:

      sh-4.3# etcdctl endpoint health --cluster
      https://10.0.128.73:2379 is healthy: successfully committed proposal: took = 4.5576ms
      https://10.0.147.172:2379 is healthy: successfully committed proposal: took = 5.1521ms
      https://10.0.171.108:2379 is healthy: successfully committed proposal: took = 4.2631ms
  4. Verify that the new member is in the list of Pods associated with etcd and that its status is Running.

    In a terminal that has access to the cluster, run the following command:

    $ oc get pods -n openshift-etcd
    NAME                                                     READY   STATUS    RESTARTS   AGE
    etcd-member-ip-10-0-128-73.us-east-2.compute.internal    2/2     Running   0          15h
    etcd-member-ip-10-0-147-172.us-east-2.compute.internal   2/2     Running   7          122m
    etcd-member-ip-10-0-171-108.us-east-2.compute.internal   2/2     Running   0          15h

2.2.2. Generating etcd certificates and adding the member to the cluster

If the node is new or the etcd certificates on the node are no longer valid, you must generate the etcd certificates before you can add the member to the etcd cluster.

Prerequisites

  • You have access to the cluster as a user with the cluster-admin role.
  • You have SSH access to the new master host to add to the etcd cluster.
  • You have SSH access to the one of the healthy master hosts.
  • You have the IP address of one of the healthy master hosts.

Procedure

  1. Set up a temporary etcd certificate signer service on one of the healthy master nodes.

    1. Access one of the healthy master nodes and log in to your cluster as a cluster-admin user using the following command.

      [core@ip-10-0-143-125 ~]$ sudo oc login https://localhost:6443
      Authentication required for https://localhost:6443 (openshift)
      Username: kubeadmin
      Password:
      Login successful.
    2. Obtain the pull specification for the kube-etcd-signer-server image.

      [core@ip-10-0-143-125 ~]$ export KUBE_ETCD_SIGNER_SERVER=$(sudo oc adm release info --image-for kube-etcd-signer-server --registry-config=/var/lib/kubelet/config.json)
    3. Run the tokenize-signer.sh script.

      Be sure to pass in the -E flag to sudo so that environment variables are properly passed to the script.

      [core@ip-10-0-143-125 ~]$ sudo -E /usr/local/bin/tokenize-signer.sh ip-10-0-143-125 1
      Populating template /usr/local/share/openshift-recovery/template/kube-etcd-cert-signer.yaml.template
      Populating template ./assets/tmp/kube-etcd-cert-signer.yaml.stage1
      Tokenized template now ready: ./assets/manifests/kube-etcd-cert-signer.yaml
      1
      The host name of the healthy master, where the signer should be deployed.
    4. Create the signer Pod using the file that was generated.

      [core@ip-10-0-143-125 ~]$ sudo oc create -f assets/manifests/kube-etcd-cert-signer.yaml
      pod/etcd-signer created
    5. Verify that the signer is listening on this master node.

      [core@ip-10-0-143-125 ~]$ ss -ltn | grep 9943
      LISTEN   0         128                       *:9943                   *:*
  2. Add the new master host to the etcd cluster.

    1. Access the new master host to be added to the cluster, and log in to your cluster as a cluster-admin user using the following command.

      [core@ip-10-0-156-255 ~]$ sudo oc login https://localhost:6443
      Authentication required for https://localhost:6443 (openshift)
      Username: kubeadmin
      Password:
      Login successful.
    2. Export two environment variables that are required by the etcd-member-recover.sh script.

      [core@ip-10-0-156-255 ~]$ export SETUP_ETCD_ENVIRONMENT=$(sudo oc adm release info --image-for machine-config-operator --registry-config=/var/lib/kubelet/config.json)
      [core@ip-10-0-156-255 ~]$ export KUBE_CLIENT_AGENT=$(sudo oc adm release info --image-for kube-client-agent --registry-config=/var/lib/kubelet/config.json)
    3. Run the etcd-member-recover.sh script.

      Be sure to pass in the -E flag to sudo so that environment variables are properly passed to the script.

      [core@ip-10-0-156-255 ~]$ sudo -E /usr/local/bin/etcd-member-recover.sh 10.0.143.125 etcd-member-ip-10-0-156-255.ec2.internal 1
      Downloading etcdctl binary..
      etcdctl version: 3.3.10
      API version: 3.3
      etcd-member.yaml found in ./assets/backup/
      etcd.conf backup upready exists ./assets/backup/etcd.conf
      Trying to backup etcd client certs..
      etcd client certs already backed up and available ./assets/backup/
      Stopping etcd..
      Waiting for etcd-member to stop
      etcd data-dir backup found ./assets/backup/etcd..
      etcd TLS certificate backups found in ./assets/backup..
      Removing etcd certs..
      Populating template /usr/local/share/openshift-recovery/template/etcd-generate-certs.yaml.template
      Populating template ./assets/tmp/etcd-generate-certs.stage1
      Populating template ./assets/tmp/etcd-generate-certs.stage2
      Starting etcd client cert recovery agent..
      Waiting for certs to generate..
      Waiting for certs to generate..
      Waiting for certs to generate..
      Waiting for certs to generate..
      Stopping cert recover..
      Waiting for generate-certs to stop
      Patching etcd-member manifest..
      Updating etcd membership..
      Member 249a4b9a790b3719 added to cluster 807ae3bffc8d69ca
      
      ETCD_NAME="etcd-member-ip-10-0-156-255.ec2.internal"
      ETCD_INITIAL_CLUSTER="etcd-member-ip-10-0-143-125.ec2.internal=https://etcd-0.clustername.devcluster.openshift.com:2380,etcd-member-ip-10-0-156-255.ec2.internal=https://etcd-1.clustername.devcluster.openshift.com:2380"
      ETCD_INITIAL_ADVERTISE_PEER_URLS="https://etcd-1.clustername.devcluster.openshift.com:2380"
      ETCD_INITIAL_CLUSTER_STATE="existing"
      Starting etcd..
      1
      Specify both the IP address of the healthy master where the signer server is running, and the etcd name of the new member.
    4. Verify that the new master host has been added to the etcd member list.

      1. Access the healthy master and connect to the running etcd container.

        [core@ip-10-0-143-125 ~] id=$(sudo crictl ps --name etcd-member | awk 'FNR==2{ print $1}') && sudo crictl exec -it $id /bin/sh
      2. In the etcd container, export variables needed for connecting to etcd.

        sh-4.3# export ETCDCTL_API=3 ETCDCTL_CACERT=/etc/ssl/etcd/ca.crt ETCDCTL_CERT=$(find /etc/ssl/ -name *peer*crt) ETCDCTL_KEY=$(find /etc/ssl/ -name *peer*key)
      3. In the etcd container, execute etcdctl member list and verify that the new member is listed.

        sh-4.3#  etcdctl member list -w table
        
        +------------------+---------+------------------------------------------+----------------------------------------------------------------+---------------------------+
        |        ID        | STATUS  |                   NAME                   |                           PEER ADDRS                           |       CLIENT ADDRS        |
        +------------------+---------+------------------------------------------+----------------------------------------------------------------+---------------------------+
        |  cbe982c74cbb42f | started |  etcd-member-ip-10-0-156-255.ec2.internal | https://etcd-0.clustername.devcluster.openshift.com:2380 |  https://10.0.156.255:2379 |
        | 249a4b9a790b3719 | started | etcd-member-ip-10-0-143-125.ec2.internal | https://etcd-1.clustername.devcluster.openshift.com:2380 | https://10.0.143.125:2379 |
        +------------------+---------+------------------------------------------+----------------------------------------------------------------+---------------------------+

        It may take up to 20 minutes for the new member to start.

  3. After the new member is added, remove the signer Pod because it is no longer needed.

    In a terminal that has access to the cluster, run the following command:

    $ oc delete pod -n openshift-config etcd-signer