Appendix A. Troubleshooting

A.1. Ansible stops installation because it detects less devices than expected

The Ansible automation application stops the installation process and returns the following error:

- name: fix partitions gpt header or labels of the osd disks (autodiscover disks)
  shell: "sgdisk --zap-all --clear --mbrtogpt -- '/dev/{{ item.0.item.key }}' || sgdisk --zap-all --clear --mbrtogpt -- '/dev/{{ item.0.item.key }}'"
  with_together:
    - "{{ osd_partition_status_results.results }}"
    - "{{ ansible_devices }}"
  changed_when: false
  when:
    - ansible_devices is defined
    - item.0.item.value.removable == "0"
    - item.0.item.value.partitions|count == 0
    - item.0.rc != 0

What this means:

When the osd_auto_discovery parameter is set to true in the /usr/share/ceph-ansible/group_vars/osds.yml file, Ansible automatically detects and configures all the available devices. During this process, Ansible expects that all OSDs use the same devices. The devices get their names in the same order in which Ansible detects them. If one of the devices fails on one of the OSDs, Ansible fails to detect the failed device and stops the whole installation process.

Example situation:

  1. Three OSD nodes (host1, host2, host3) use the /dev/sdb, /dev/sdc, and dev/sdd disks.
  2. On host2, the /dev/sdc disk fails and is removed.
  3. Upon the next reboot, Ansible fails to detect the removed /dev/sdc disk and expects that only two disks will be used for host2, /dev/sdb and /dev/sdc (formerly /dev/sdd).
  4. Ansible stops the installation process and returns the above error message.

To fix the problem:

In the /etc/ansible/hosts file, specify the devices used by the OSD node with the failed disk (host2 in the Example situation above):

[osds]
host1
host2 devices="[ '/dev/sdb', '/dev/sdc' ]"
host3

See Chapter 5, Installing Red Hat Ceph Storage using Ansible for details.