Ceph - Failure to start migrated instance after hypervisor crash

Solution Verified - Updated -

Environment

  • Red Hat Ceph Storage 1.3.z and greater

Issue

  • After a hypervisor crash, a migrated instance backed by Ceph fails to start, due to the client adding themselves as a 'watcher' on the RBD they are connected to.

Resolution

  • In a Ceph monitor node check the capabilities of the OpenStack client.
# ceph auth list
  • If "osd blacklist" is missing then modify the client keyring for your OpenStack client to enable osd blacklist tomon 'allow r, allow command "osd blacklist"

1: Export the CephX authentication keyring for your OSP environment:

# ceph auth export client.${name} -o client.${name}.export

-NOTE: The ${name} should be replaced with the actual keyring name for your OpenStack clients

2: Edit the client.${name}.export to modify the 'caps mon' line like the following:

# caps mon = 'allow r, allow command "osd blacklist"'

3: Save the file, then import it into Ceph:

# ceph auth import -i client.${name}.export

-NOTE: You can also change the active keyring with the following command, but it is recommended to take an export of the keyring first:

# ceph auth caps client.<ID> mon 'allow r, allow command "osd blacklist"' osd '<existing osd caps>'

-Example:

# ceph auth caps client.cinder mon 'allow r, allow command "osd blacklist"' osd 'allow class-read object_prefix rbd_children, allow rwx pool=cinder, allow rx pool=glance'

Root Cause

  • In Ceph, clients have the ability to add themselves as a 'watcher' on the RBD they connect to.
  • This 'watcher' state is equivalent to an exclusive lock on the block device to prevent other clients making changes which could result in inconsistencies of the block device.
  • Adding this Capability in the OpenStack client Keyring, the OpenStack nodes can advise that the 'watcher' state from this IP address should be cleared (blacklisted) so that one of the remaining active nodes can boot the instance.
  • Without this 'osd blacklist' capability, the OpenStack client keyring does not permit this clearing of stale 'watcher' states.

Diagnostic Steps

  • Check the following logs on a monitor node for entries similar to:

/var/log/messages

Mar 22 15:55:24 dev-ceph-01 docker: 2018-03-22 15:55:24.018624 7f7b084aa700  0 log_channel(audit) log [INF] : from='client.? 10.0.0.1:0/2275353734' entity='client.openstack' cmd=[{"prefix": "osd blacklist", "blacklistop": "add", "addr": "10.0.0.2:0/3833716830"}]:  access denied

/var/log/ceph/ceph.audit.log

2018-03-20 17:28:33.246133 mon.dev-ceph-01 mon.0 10.0.0.3:6789/0 6756 : audit [INF] from='client.? 10.0.0.4:0/3936533106' entity='client.openstack' cmd=[{"prefix": "osd blacklist", "blacklistop": "add", "addr": "10.0.0.5:0/619259744"}]:  access denied

This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.

Close

Welcome! Check out the Getting Started with Red Hat page for quick tours and guides for common tasks.