Issue affecting minor updates of Red Had Ceph Storage 3 can cause OSDs corruption

Solution Verified - Updated -

Environment

  • Red Hat Ceph Storage 3
  • Red Hat Enterprise Linux 7
  • Red Hat OpenStack Platform 13

Issue

A known issue affecting minor updates of Red Had Ceph Storage 3 can cause OSD corruption when OSDs are deployed in containers using ceph-ansible OR OSDs are deployed by using Red Hat OpenStack 13 director:

  • A missing dependency in the Ceph OSD systemd units file causes abrubt termination of the containers on docker package updates and service restarts.
  • A service disruption or even data corruption on uncontrolled updates of the docker package on Ceph OSD nodes.

Resolution

Perform the following steps for your director-driven or standalone deployment.

OpenStack director-driven RHCS3 deployments

Before starting with the undercloud minor update, update ceph-ansible to be newer than 3.2.44 :

# yum update ceph-ansible

Run a heat stack update identical to the last execution of the overcloud deploy command, with the same arguments and any heat environment file :

# openstack overcloud deploy --templates \
-e /usr/share/openstack-tripleo-heat-templates/environments/ceph-ansible/ceph-ansible.yaml \
-e /home/stack/templates/ceph-custom-config.yaml \
-e …

After the execution, ensure that Requires=docker.service appears in the systemd units of the Ceph OSD containers, for example :

(undercloud) $ ssh heat-admin@overcloud-ceph-0
(overcloud-ceph-0) $ grep Requires /etc/systemd/system/ceph-osd\@.service
Requires=docker.service

Standalone RHCS3 deployments

Update ceph-ansible package :

# yum update ceph-ansible

Refresh the systemd units on the cluster nodes by re-running the site-docker.yaml playbook. Assuming any group_var and inventory file created for the initial deployment to be available :

# ansible-playbook site-docker.yml

After the execution, ensure that Requires=docker.service appears in the systemd units of the Ceph OSD containers on the target nodes.

Results

After these steps, and when all Ceph OSDs systemd units have been updated to include the Requires=docker.service line, you can initiate the standard update process for Red Hat Ceph Storage or Red Hat OpenStack.

Root Cause

Red Hat Ceph Storage 3 relies on docker for containerized deployments running on RHEL 7. The ceph-ansible fix for BZ1846830 updates the systemd units controlling Ceph containers making the systemd units require the docker service to be up and running for execution. This requirement is essential to implement a safe update path and avoid service disruption or even data corruption on uncontrolled updates of the docker package.

A missing dependency in the Ceph OSD systemd units file causes abrubt termination of the containers on docker package updates and service restarts.

Updating the ceph-ansible package is not sufficient for the fix to be effective. It is necessary to update the containers' systemd units by rerunning the deployment playbook.

Diagnostic Steps

To verify if the Ceph cluster can be updated safely, ensure that the line Requires=docker.service appears in the systemd units of Ceph OSD containers. For example, for director-driven deployments, log on all nodes hosting a Ceph OSD container and check if the desired "Requires" line is found:

(undercloud) $ ssh heat-admin@overcloud-ceph-0
(overcloud-ceph-0) $ grep Requires /etc/systemd/system/ceph-osd\@.service
Requires=docker.service

Should the output not be showing the desired dependency, it is essential to update the units following the instructions Resolution section.

This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.