Cleaning up after a failed etcd v2 to v3 migration
Environment
OpenShift 3.6 while migrating from etcd2 to etcd3
Issue
If the etcd v2 to v3 migration process fails after v2 keys have been copied to v3 keys and the cluster is started using v2 storage backend you cannot re-run the migration playbooks to re-migrate while ensuring data consistency.
Resolution
Overview
WARNING: Please contact Red Hat Support for assistance before starting this process.
WARNING: Please ensure that the backups taken during this procedure are preserved.
If a cluster had migrated v2 to v3 and then fell back to using v2 data
the v3 data must be cleaned up prior to re-migration. Also, due to the
nature of etcd’s data storage if you were to simply delete all v3 keys
and re-migrate you would also lose data. As such you must create a new
etcd instance, copy just the v2 data from the old instance to the new
instance, then use that new instance’s member directory to build the
cluster back up to it’s original member count before re-running the migration process.
This KCS provides a tool which you may use to copy the etcd database called etcd-hard-migrate which has its source available on github. The supportability of this tool is undefined.
Copy etcd-hard-migrate binary onto first master
Ensure that the hard-migrate binary attached to this solution has been downloaded and extracted on your first master.
You can obtain it attached to this KCS or the source code is available from Github
Migration Cleanup
Start a second instance
Ensure you have two terminal windows open to the first master, you’re
going to run an etcd instance in one and run other commands in another.
On the first master, start up a second etcd instance running out of
/tmp/etcd-migration, this will run until you press CTRL-C to kill it
after copying the data later in the process.
rm -rf /tmp/etcd-migration/
mkdir /tmp/etcd-migration/
eval $(cat /etc/etcd/etcd.conf |
egrep '(TRUSTED_CA|CERT|KEY)_FILE' |
while read line; do echo export $line; done
)
etcd \
--data-dir /tmp/etcd-migration/ \
--listen-client-urls https://<please use primary master IP>:3000 \
--name tmp-etcd \
--advertise-client-urls https://<please use primary master IP>:3000 \
--listen-peer-urls https://<please use primary master IP>:2999
In your second terminal, verify it’s running
/usr/bin/etcdctl \
--cert-file /etc/etcd/peer.crt \
--key-file /etc/etcd/peer.key \
--ca-file /etc/etcd/ca.crt \
-C https://<please use primary master IP>:3000 cluster-health
Stop Master Services
On all masters, stop master services
systemctl stop atomic-openshift-master-controllers atomic-openshift-master-api
Copy data to the new etcd service
On the first master, run the etcd-hard-migrate tool to copy all v2 keys
to the new cluster
./etcd-hard-migrate \
--src-cacert /etc/etcd/ca.crt \
--src-cert /etc/etcd/peer.crt \
--src-key /etc/etcd/peer.key \
--src-etcd-address https://<please use primary master IP>:2379 \
--dest-cacert /etc/etcd/ca.crt \
--dest-cert /etc/etcd/peer.crt \
--dest-key /etc/etcd/peer.key \
--dest-etcd-address https://<please use primary master IP>:3000
Note
Replace <please use primary master IP> with the actual IP
address of the master.
Stop and cleanup normal etcd services
On all masters, stop the normal etcd services and move their member
directories:
sytemctl stop etcd
mv /var/lib/etcd/member /var/lib/etcd/member-pre-cleanup
Stop the second etcd instance by pressing CTRL-C
CTRL-C
Copy the temporary instance’s member data into place
On the first master, copy the new member directory into place, and start
it up forcing a new cluster
cp -r /tmp/etcd-migration/* /var/lib/etcd/
chown -R etcd: /var/lib/etcd
restorecon -Rv /var/lib/etcd
echo ETCD_FORCE_NEW_CLUSTER=true >> /etc/etcd/etcd.conf
systemctl start etcd
sed -i '/ETCD_FORCE_NEW_CLUSTER=true/d' /etc/etcd/etcd.conf
systemctl restart etcd
Verify the first member
Verify that the information is correct, the key points are ensuring that all urls are https, the correct IP address is used, and that the peerlURLs port is 2380 while the clientURLs port is 2379
etcdctl2 member list
8e9e05c52164694d: name=ose3-master.example.com peerURLs=http://192.168.11.44:2380 clientURLs=https://192.168.11.44:2379 isLeader=true
If it’s not you will need to update it.
etcdctl2 member update 8e9e05c52164694d https://192.168.11.44:2380
Add the 2nd etcd host as a member via the first etcd host
Find the etcd name and peer url of the second master. (On the second
master)
# grep -e ETCD_NAME -e ETCD_LISTEN_PEER_URLS /etc/etcd/etcd.conf
ETCD_NAME=ose3-node1.example.com
ETCD_LISTEN_PEER_URLS=https://192.168.11.22:2380
The important information here is the ETCD_NAME and the
ETCD_LISTEN_PEER_URLS, these are the two pieces you’ll need for the
next command. In the next block of code, these are what replace
"ose3-node1.example.com" and "https://192.168.11.22:2380" respectively.
On the primary master:
etcdctl2 member add ose3-node1.example.com https://192.168.11.22:2380
# Added member named ose3-node1.example.com with ID fc718486578c3a61 to cluster
# ETCD_NAME="ose3-node1.example.com"
# ETCD_INITIAL_CLUSTER="ose3-master.example.com=https://192.168.11.44:2380,ose3-node1.example.com=https://192.168.11.22:2380"
# ETCD_INITIAL_CLUSTER_STATE="existing"
Copy the last 3 lines of output to the end of /etc/etcd/etcd.conf on the
2nd etcd host, then start etcd
systemctl start etcd
Verify member list shows two members
etcdctl2 member list
# 8e9e05c52164694d: name=ose3-master.example.com peerURLs=https://192.168.11.44:2380 clientURLs=https://192.168.11.44:2379 isLeader=true
# fc718486578c3a61: name=ose3-node1.example.com peerURLs=https://192.168.11.22:2380 clientURLs=https://192.168.11.22:2379 isLeader=false
Add the 3rd etcd host as a member via the first etcd host
Repeat the previous process for adding the 2nd etcd host as a member,
for the 3rd etcd host.
Verify member list shows all three members
etcdctl2 member list
# 8e9e05c52164694d: name=ose3-master.example.com peerURLs=https://192.168.11.44:2380 clientURLs=https://192.168.11.44:2379 isLeader=true
# fc718486578c3a61: name=ose3-node1.example.com peerURLs=https://192.168.11.22:2380 clientURLs=https://192.168.11.22:2379 isLeader=false
# c29628fad095cd6b: name=ose3-node2.example.com peerURLs=https://192.168.11.33:2380 clientURLs=https://192.168.11.33:2379 isLeader=false
Verify there are no v3 keys
Your etcd cluster is now back to v2 and should have no v3 keys, verify
that by attempting to list v3 keys, you should get no results.
etcdctl3 get /kubernetes.io --keys-only --prefix
etcdctl3 get /openshift.io --keys-only --prefix
Restart master services
On all three hosts start master services
systemctl start atomic-openshift-master-api atomic-openshift-master-controllers
Check for snapshots
A snapshot is required for v2 to v3 migration and a snapshot is only
taken after 10,000 writes to the database by default. If the cluster is
small you may not have 10,000 writes. So you may force a snapshot by
configuring a low snapshot count, restarting, then removing that value,
and restarting.
Check for a .snap file in /var/lib/etcd/member/snap
ls -la /var/lib/etcd/member/snap/
-rw-r--r--. 1 etcd etcd 37276964 Mar 26 12:07 0000000000000127-000000002ea888c8.snap
-rw-r--r--. 1 etcd etcd 36159743 Mar 26 12:17 0000000000000127-000000002ea8afd9.snap
If there are no .snap files then On each master, one by one
echo "ETCD_SNAPSHOT_COUNTER=1000" >> /etc/etcd/etcd.conf
systemctl restart etcd
sleep 30
sed -i '/^ETCD_SNAPSHOT_COUNTER=1000/d' /etc/etcd/etcd.conf
systemctl restart etcd
You’re now ready to follow standard etcd migration
procedures
Root Cause
Etcd uses mvcc to assure consistency across all its members. For this reason, whenever a key is deleted in etcd it's marked as deleted with a higher revision number than it had before.
When etcdctl migrate is executed it will only copy keys that have the higher revision number in etcd2 or that have never been created in etcd3.
Diagnostic Steps
This situation can only happen if a user deletes manually etcd3 keys.
Attachments
This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.
Comments