Ceph - Failure to start migrated instance after OpenStack compute crash

Solution Verified - Updated -

Environment

  • Red Hat Ceph Storage
  • Red Hat OpenStack Platform

Issue

  • After a hypervisor crash, a migrated instance backed by Ceph fails to start, due to the client adding themselves as a 'watcher' on the RBD they are connected to.

Resolution

  • In a Ceph monitor node check the capabilities of the OpenStack client.
[root@ceph-mon]# ceph auth list
  • If "osd blacklist" is missing then modify the client keyring for your OpenStack client to enable osd blacklist tomon 'allow r, allow command "osd blacklist"

1) Export the CephX authentication keyring for your OSP environment, and create a backup:

[root@ceph-mon]# ceph auth export client.openstack -o client.openstack.export
[root@ceph-mon]# cp client.openstack.export client.openstack.export.backup

2) Edit the client.${name}.export to modify the 'caps mon' line like the following:

  • NOTE: You must use double quotes to enclose the value, and use escaped double quotes to enclose osd blacklist
    caps mon = "allow r, allow command \"osd blacklist\""

3) Check ceph auth list to ensure caps look good

GOOD

[root@ceph-mon]# ceph auth list | grep client.openstack -A3
client.openstack
        key: AQAeo/5dAAAAABAAiZTSeas0vRYYeTcYmBeRtA==
        caps mon: allow r, allow command "osd blacklist"
        caps osd: allow class-read object_prefix rbd_children, allow rwx pool=volumes, allow rwx pool=vms, allow rwx pool=images

BAD -- you enclosed the phrase in single quotes.

[root@ceph-mon]# ceph auth list | grep client.openstack -A3
client.openstack
        key: AQAeo/5dAAAAABAAiZTSeas0vRYYeTcYmBeRtA==
        caps mon: 'allow r, allow command "osd blacklist"'
        caps osd: allow class-read object_prefix rbd_children, allow rwx pool=volumes, allow rwx pool=vms, allow rwx pool=images

BAD -- you didn't escape your internal quotes

[root@ceph-mon]# ceph auth list | grep client.openstack -A3
client.openstack
        key: AQAeo/5dAAAAABAAiZTSeas0vRYYeTcYmBeRtA==
        caps osd: allow class-read object_prefix rbd_children, allow rwx pool=volumes, allow rwx pool=vms, allow rwx pool=images

If you get either of the bad results, import your backup before you begin troubleshooting.

4) Save the file, then import it into Ceph:

[root@ceph-mon]# ceph auth import -i client.openstack.export
  • NOTE1: You can also change the active keyring with the following command, but it is recommended to take an export of the keyring first:
[root@ceph-mon]# ceph auth caps client.<ID> mon 'allow r, allow command "osd blacklist"' osd '<existing osd caps>'
  • Example:
[root@ceph-mon]# ceph auth caps client.cinder mon 'allow r, allow command "osd blacklist"' osd 'allow class-read object_prefix rbd_children, allow rwx pool=cinder, allow rx pool=glance'
  • NOTE2: There is no need to update actual keyring files on Overcloud nodes as actual key remains the same.

Root Cause

  • In Ceph, clients have the ability to add themselves as a 'watcher' on the RBD they connect to.
  • This 'watcher' state is equivalent to an exclusive lock on the block device to prevent other clients making changes which could result in inconsistencies of the block device.
  • Adding this Capability in the OpenStack client Keyring, the OpenStack nodes can advise that the 'watcher' state from this IP address should be cleared (blacklisted) so that one of the remaining active nodes can boot the instance.
  • Without this 'osd blacklist' capability, the OpenStack client keyring does not permit this clearing of stale 'watcher' states.

There are two bugs reported for RHOSP engineering:

  • BZ#1838145 is used to deliver a fix for Ceph clusters managed by TripleO
  • BZ#1844360 is used to track documentation fix for RHOSP deployments with external Ceph clusters.

Diagnostic Steps

  • Check the following logs on a monitor node for entries similar to:

/var/log/messages

Mar 22 15:55:24 dev-ceph-01 docker: 2018-03-22 15:55:24.018624 7f7b084aa700  0 log_channel(audit) log [INF] : from='client.? 10.0.0.1:0/2275353734' entity='client.openstack' cmd=[{"prefix": "osd blacklist", "blacklistop": "add", "addr": "10.0.0.2:0/3833716830"}]:  access denied

/var/log/ceph/ceph.audit.log

2018-03-20 17:28:33.246133 mon.dev-ceph-01 mon.0 10.0.0.3:6789/0 6756 : audit [INF] from='client.? 10.0.0.4:0/3936533106' entity='client.openstack' cmd=[{"prefix": "osd blacklist", "blacklistop": "add", "addr": "10.0.0.5:0/619259744"}]:  access denied

This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.

1 Comments

For the export/import process, there shouldn't be single-ticks in the line. This line seems to work: caps mon = "profile rbd, allow command \"osd blacklist\""