Issue affecting minor updates of Red Hat Ceph Storage 3 can cause OSDs corruption

Solution Verified - Updated -

Environment

  • Red Hat Ceph Storage (RHCS) 3
  • Red Hat Enterprise Linux (RHEL) 7
  • Red Hat OpenStack Platform (RHOSP) 13

Issue

There is a known issue which can affect minor updates of Red Had Ceph Storage 3 which can cause OSD corruption when OSDs are deployed in containers using ceph-ansible or when OSDs are deployed using Red Hat OpenStack 13 director.

  • A missing dependency in the Ceph OSD systemd units file causes abrubt termination of the containers on docker package updates and service restarts.
  • A service disruption and potential data corruption on uncontrolled updates of the docker package on Ceph OSD nodes.

Resolution

Perform the following steps for your director-driven or standalone deployment.

RHOSP 13 director-driven RHCS 3 deployments

Run the openstack overcloud ceph-upgrade command to update the containerized RHCS 3 cluster before running the RHOSP 13 overcloud update.

  1. Complete the undercloud update.
  2. Make sure that the ceph-ansible package version on the undercloud is >= v3.2.52:
$ rpm -q ceph-ansible
  1. Complete all steps from Keeping Red Hat OpenStack Platform Updated up to 4.4. Updating all Controller nodes
  2. Run the Ceph Storage update command. For example:
$ openstack overcloud ceph-upgrade run \
--templates \
-e <ENVIRONMENT FILE> \
-e /home/stack/templates/overcloud_images.yaml \
-e /home/stack/templates/updates-environment.yaml
  1. After the execution, ensure that Requires=docker.service appears in the systemd units of the Ceph OSD containers, for example :
(undercloud) $ ssh heat-admin@overcloud-ceph-0
(overcloud-ceph-0) $ grep Requires /etc/systemd/system/ceph-osd\@.service
Requires=docker.service
  1. Continue with the overcloud update from step 4.4. Updating all Controller nodes.

Standalone RHCS 3 deployments

  1. Update the ceph-ansible package on your deployment node:
# yum update ceph-ansible
  1. Refresh the systemd units on the cluster nodes by re-running the site-docker.yaml playbook. Please ensure that any group_var and inventory file created for the initial deployment is still available :
# ansible-playbook site-docker.yml
  1. After the playbook execution, ensure that Requires=docker.service appears in the systemd units of the Ceph OSD containers on the storage nodes:
$ grep Requires /etc/systemd/system/ceph-osd\@.service
Requires=docker.service

Results

After you have executed the steps above and you have verified that all Ceph OSD systemd units have been updated to include the Requires=docker.service line, you can initiate the standard update process for Red Hat Ceph Storage.

Root Cause

RHCS 3 relies on docker for containerized deployments running on RHEL 7. The ceph-ansible fix for BZ1846830 updates the systemd units controlling Ceph containers making the systemd units require the docker service to be up and running for execution. This requirement is essential to implement a safe update path and avoid service disruption and potential data corruption on uncontrolled updates of the docker package.

A missing dependency in the Ceph OSD systemd units file causes abrubt termination of the containers on docker package updates and service restarts.

Updating the ceph-ansible package is not sufficient for the fix to be effective. It is necessary to update the containers' systemd units by rerunning the deployment playbook.

Diagnostic Steps

To verify if the Ceph cluster can be updated safely, ensure that the line Requires=docker.service appears in the systemd units of Ceph OSD containers. For example, for director-driven deployments, log into all nodes hosting a Ceph OSD container and inspect the systemd unit file:

(undercloud) $ ssh heat-admin@overcloud-ceph-0
(overcloud-ceph-0) $ grep Requires /etc/systemd/system/ceph-osd\@.service
(overcloud-ceph-0) $

If the output does not show the Requires=docker.service line, like in the example above, it is essential to update the systemd unit file following the instructions in the Resolution section.

This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.

Comments