Error during cluster upgrade in task "etcd : Generate etcd backup"
Issue
During the upgrade of the control plane in an Red Hat OpenShift Container Platform cluster, the playbook fails with the following error:
# ansible-playbook -i inventory /usr/share/ansible/openshift-ansible/playbooks/byo/openshift-cluster/upgrades/v3_10/upgrade_control_plane.yml
...
TASK [etcd : Generate etcd backup] **********************************************************************************************************************************************************
Monday 02 September 2019 11:45:57 +0200 (0:00:00.402) 0:28:19.054 ******
fatal: [master-1.ocpcarc.local]: FAILED! => {"changed": true, "cmd": ["/usr/local/bin/master-exec", "etcd", "etcd", "etcdctl", "backup", "--data-dir=/var/lib/etcd/", "--backup-dir=/var/lib/etcd//openshift-backup-post-3.0-20190902114554"], "delta": "0:00:03.116419", "end": "2019-09-02 11:46:00.941363", "failed": true, "msg": "non-zero return code", "rc": 141, "start": "2019-09-02 11:45:57.824944", "stderr": "", "stderr_lines": [], "stdout": "", "stdout_lines": []}
fatal: [master-0.ocpcarc.local]: FAILED! => {"changed": true, "cmd": ["/usr/local/bin/master-exec", "etcd", "etcd", "etcdctl", "backup", "--data-dir=/var/lib/etcd/", "--backup-dir=/var/lib/etcd//openshift-backup-post-3.0-20190902114554"], "delta": "0:00:03.107044", "end": "2019-09-02 11:46:00.957878", "failed": true, "msg": "non-zero return code", "rc": 141, "start": "2019-09-02 11:45:57.850834", "stderr": "", "stderr_lines": [], "stdout": "", "stdout_lines": []}
fatal: [master-2.ocpcarc.local]: FAILED! => {"changed": true, "cmd": ["/usr/local/bin/master-exec", "etcd", "etcd", "etcdctl", "backup", "--data-dir=/var/lib/etcd/", "--backup-dir=/var/lib/etcd//openshift-backup-post-3.0-20190902114554"], "delta": "0:00:03.902509", "end": "2019-09-02 11:46:01.718962", "failed": true, "msg": "non-zero return code", "rc": 141, "start": "2019-09-02 11:45:57.816453", "stderr": "", "stderr_lines": [], "stdout": "", "stdout_lines": []}
The etcd cluster is checked and it is in healthy status.
Running the same backup command used by the playbook, in one of the masters, produces a message in the output with an error code (The directory used with --backup-dir must not exist or be empty):
master-0# /usr/local/bin/master-exec etcd etcd etcdctl backup --data-dir=/var/lib/etcd --backup-dir=/var/lib/etcd/openshift-backup-post-3.0-201909271244
command terminated with exit code 141
Error code 141 means "pipe fail" but the backup is actually completed successfully and stored in the directory specified.
Running the backup from inside one of the etcd pods, it completes successfully and without errors but produces an unexpected output message, which is what is causing the error 141 in the first command:
# oc exec -ti master-etcd-master-0.example.local -n kube-system -- /bin/sh
sh-4.2# etcdctl --cert-file /etc/etcd/peer.crt --key-file /etc/etcd/peer.key --ca-file /etc/etcd/ca.crt --endpoints https://172.16.10.12:2379 backup --data-dir /var/lib/etcd/ --backup-dir /var/lib/etcd/openshift-backup-post-3.0-20190902114554
2019-09-02 16:36:13.283833 I | wal: segmented wal file /var/lib/etcd/openshift-backup-post-3.0-20190902114554/member/wal/0000000000000001-000000000d071adf.wal is created
Environment
- Red Hat OpenShift Container Platform
- 3.9
- 3.10
- 3.11
Subscriber exclusive content
A Red Hat subscription provides unlimited access to our knowledgebase, tools, and much more.