Error during cluster upgrade in task "etcd : Generate etcd backup"

Solution In Progress - Updated -

Issue

During the upgrade of the control plane in an Red Hat OpenShift Container Platform cluster, the playbook fails with the following error:

# ansible-playbook -i inventory  /usr/share/ansible/openshift-ansible/playbooks/byo/openshift-cluster/upgrades/v3_10/upgrade_control_plane.yml
...
TASK [etcd : Generate etcd backup] **********************************************************************************************************************************************************
Monday 02 September 2019  11:45:57 +0200 (0:00:00.402)       0:28:19.054 ******
fatal: [master-1.ocpcarc.local]: FAILED! => {"changed": true, "cmd": ["/usr/local/bin/master-exec", "etcd", "etcd", "etcdctl", "backup", "--data-dir=/var/lib/etcd/", "--backup-dir=/var/lib/etcd//openshift-backup-post-3.0-20190902114554"], "delta": "0:00:03.116419", "end": "2019-09-02 11:46:00.941363", "failed": true, "msg": "non-zero return code", "rc": 141, "start": "2019-09-02 11:45:57.824944", "stderr": "", "stderr_lines": [], "stdout": "", "stdout_lines": []}
fatal: [master-0.ocpcarc.local]: FAILED! => {"changed": true, "cmd": ["/usr/local/bin/master-exec", "etcd", "etcd", "etcdctl", "backup", "--data-dir=/var/lib/etcd/", "--backup-dir=/var/lib/etcd//openshift-backup-post-3.0-20190902114554"], "delta": "0:00:03.107044", "end": "2019-09-02 11:46:00.957878", "failed": true, "msg": "non-zero return code", "rc": 141, "start": "2019-09-02 11:45:57.850834", "stderr": "", "stderr_lines": [], "stdout": "", "stdout_lines": []}
fatal: [master-2.ocpcarc.local]: FAILED! => {"changed": true, "cmd": ["/usr/local/bin/master-exec", "etcd", "etcd", "etcdctl", "backup", "--data-dir=/var/lib/etcd/", "--backup-dir=/var/lib/etcd//openshift-backup-post-3.0-20190902114554"], "delta": "0:00:03.902509", "end": "2019-09-02 11:46:01.718962", "failed": true, "msg": "non-zero return code", "rc": 141, "start": "2019-09-02 11:45:57.816453", "stderr": "", "stderr_lines": [], "stdout": "", "stdout_lines": []}

The etcd cluster is checked and it is in healthy status.

Running the same backup command used by the playbook, in one of the masters, produces a message in the output with an error code (The directory used with --backup-dir must not exist or be empty):

master-0# /usr/local/bin/master-exec etcd etcd etcdctl backup --data-dir=/var/lib/etcd --backup-dir=/var/lib/etcd/openshift-backup-post-3.0-201909271244
command terminated with exit code 141

Error code 141 means "pipe fail" but the backup is actually completed successfully and stored in the directory specified.

Running the backup from inside one of the etcd pods, it completes successfully and without errors but produces an unexpected output message, which is what is causing the error 141 in the first command:

# oc exec -ti master-etcd-master-0.example.local -n kube-system  -- /bin/sh

sh-4.2# etcdctl --cert-file /etc/etcd/peer.crt --key-file /etc/etcd/peer.key --ca-file /etc/etcd/ca.crt --endpoints https://172.16.10.12:2379 backup --data-dir /var/lib/etcd/ --backup-dir /var/lib/etcd/openshift-backup-post-3.0-20190902114554

2019-09-02 16:36:13.283833 I | wal: segmented wal file /var/lib/etcd/openshift-backup-post-3.0-20190902114554/member/wal/0000000000000001-000000000d071adf.wal is created

Environment

  • Red Hat OpenShift Container Platform
    • 3.9
    • 3.10
    • 3.11

Subscriber exclusive content

A Red Hat subscription provides unlimited access to our knowledgebase, tools, and much more.

Current Customers and Partners

Log in for full access

Log In

New to Red Hat?

Learn more about Red Hat subscriptions

Using a Red Hat product through a public cloud?

How to access this content