Ceph: waiting for the monitor(s) to form the quorum

Latest response

Trying to install rhceph using cockpit or ansible-playbook. And in all cases, the installation stops at the task [ceph-mon: waiting for the monitor (s) to form the quorum ...] . I do the installation through containers or RPM but all the time this item. The containers are running. I tried to execute a command that runs during this task, but nothing happens. What should be and why this task is also performed during installation using RPM

[root@ceph-node1]# podman ps
CONTAINER ID  IMAGE                                                  COMMAND               CREATED            STATUS                PORTS  NAMES
35aea4118ac5  registry.redhat.io/rhceph/ansible-runner-rhel8:latest  /usr/bin/supervis...  About an hour ago  Up About an hour ago         runner-service
15c899b010dc  registry.redhat.io/rhceph/rhceph-4-rhel8:latest                              About an hour ago  Up About an hour ago         ceph-mon-ceph-node1
[root@ceph-node1]# cat /usr/share/ceph-ansible/roles/ceph-mon/tasks/ceph_keys.yml
---
- name: waiting for the monitor(s) to form the quorum...
  command: >
    {{ container_exec_cmd }}
    ceph
    --cluster {{ cluster }}
    -n mon.
    -k /var/lib/ceph/mon/{{ cluster }}-{{ ansible_hostname }}/keyring
    mon_status
    --format json
  register: ceph_health_raw
  run_once: true
  until: >
    (ceph_health_raw.stdout | length > 0) and (ceph_health_raw.stdout | default('{}') | from_json)['state'] in ['leader', 'peon']
  retries: "{{ handler_health_mon_check_retries }}"
  delay: "{{ handler_health_mon_check_delay }}"
  changed_when: false

When trying to execute

podman exec ceph-mon-ceph-node1 ceph --cluster ceph -n mon. -k /var/lib/ceph/mon/ceph-ceph-node1/keyring mon_status --format json

Output is empty

Responses

same issue

hi , same issue on my setup ... any updates ... Thanks

hi, i have same issue

I have the same issue...

RPM based installation (EL8) Ceph : Octopus (15.2.1) ceph-ansible: stable-5.0

Hi, I learned from Red Hat support, problems depends on monitoring network. The monitoring network should be on public network. If you set monitoring network on public network, installation should work correctly. Good luck.

If you set monitoring network on public network

What do you mean if you set monitoring network on public? We don't set the monitoring network, it isn't an option during installation. The cockpit installer doesn't give us an option to pick the monitoring network. How do we do this?

Yes, web ui have only ; cluster network, public network and s3 network.

Just for testing; Could you use same network for cluster network and public network then test it. However you can check; /usr/share/ceph-ansible/group_vars/all.yml you can find out them like as; public_network: 10.34.0.0/16 monitor_address_block: 10.34.0.0/16

This is what caused my initial installation to fail. Because all.yml had my monitor network on the cluster number IP block.

i have same issue

Was there a resolution to this issue? I am also encountering the same problem with ceph-ansible.

did you check monitor network same as public network? You can find them; /usr/share/ceph-ansible/group_vars/all.yml

Ultimately I had to the do the following to fix "Ceph: waiting for the monitor(s) to form the quorum" As my first failed install had left mon containers running on my cluster nodes I had to do some cleanup before I could continue. I was installing on Redhat Enterprise Linux 7.7.

1) Stopped the Docker services on the cluster nodes. Not on the Ansible Administrator node.

2) Uninstall docker and remove all docker files on the cluster nodes.

sudo yum remove docker-client docker-common python-docker-pycreds docker-rhel-push-plugin python-docker-py
sudo rm -f -r /var/lib/docker /etc/docker /usr/bin/docker* /var/lib/docker /root/.docker /usr/libexec/docker /etc/sysconfig/docker-storage.rpmsave

3) I then purged the previous installation by running the below commands on the Ansible Administrator node.

cd /usr/share/ceph-ansible
cp infrastructure-playbooks/purge-container-cluster.yml /usr/share/ceph-ansible
ansible-playbook purge-container-cluster.yml

4) I then went through the cockpit installation again. At the point where you click save, it writes the /usr/share/ceph-ansible/group_vars/all.yml file.

5) Before continuing with the deploy, I opened this file with a text editor and set the monitor_address_block to be the same as the public_network.

monitor_address_block: 172.19.13.0/24
node_exporter_container_image: registry.redhat.io/openshift4/ose-prometheus-node-exporter:v4.1
prometheus_container_image: registry.redhat.io/openshift4/ose-prometheus:4.1
public_network: 172.19.13.0/24

6) After making this change I clicked Deploy on the Cockpit installer and the installation began, quickly passing the mon setup phase where I had been previously stuck, and within a few minutes I had a finally finished my first Ceph installation.

I hope this helps anyone else having this problem. It took almost a week working with Redhat support to find it.

Hi Michael, thanks for your detailed reply and notes. I have run through the commands as you have suggested but unfortunately still end up in the same position where the playbooks are stuck at "waiting for quorum". I am not sure if you have tried to deploy ceph-ansible across separate physical nodes as I am? So, you would have physical node #1 hosting 1 mon and and osd service etc.. I am also running RHEL7 hosts but deploying RHEL8 (ceph4-rhel8) containers and following the Red Hat guidance installation instructions. Do I need to set anything particular with the docker configuration to enable communication between a monitor container on node #1 and a monitor container on node #2?

I'm clutching at straws abit here and have spent a number of days trying to get this working.

One thing to note is that if I attempt to "exec' into the ceph4-rhel8 container and run "ceph -s" this commands hangs. Even running "ceph -h" has problems and times out without completing.

Any help is appreciated.