Chapter 6. Known issues

This section documents known issues found in this release of Red Hat Ceph Storage.

6.1. Ceph Ansible

The shrink-osd.yml playbook currently has no support for removing OSDs created by ceph-volume

The shrink-osd.yml playbook assumes all OSDs are created by ceph-disk. As a result, OSDs deployed using ceph-volume cannot be shrunk.

As a workaround, OSDs deployed using ceph-volume can be removed manually.

(BZ#1569413)

The container does not restart on option changes

When changing an option, for example, ceph_osd_docker_memory_limit, the change will not trigger a restart of the container.

To work around this issue restart the container manually.

(BZ#1596061)

Purging the cluster will try to unmount a partition from /var/lib/ceph

If you mount a partition to /var/lib/ceph, running the purge playbook will cause a failure when it tries to unmount it.

To work around this issue, do not mount a partition to /var/lib/ceph.

(BZ#1615872)

When putting a dedicated journal on an NVMe device installation can fail

If dedicated_devices contains an NVMe device and it has partitions or signatures on it Ansible installation might fail with an error like the following:

journal check: ondisk fsid 00000000-0000-0000-0000-000000000000 doesn't match expected c325f439-6849-47ef-ac43-439d9909d391, invalid (someone else's?) journal

To work around this issue ensure there are no partitions or signatures on the NVMe device.

(BZ#1619090)

Running the Ansible playbook, purge-iscsi-gateways.yml does not stop and disable the iSCSI gateway services

When purging the Ceph iSCSI gateways using Ceph Ansible, the iSCSI gateway services are still running. You must manually stop and disable these services by doing the following as root:

systemctl stop rbd-target-api
systemctl stop rbd-target-rbd
systemctl stop tcmu-runner

systemctl disable rbd-target-api
systemctl disable rbd-target-rbd
systemctl disable tcmu-runner

If you are using the gwcli command to manage the iSCSI gateways, then do not stop or disable these services.

(BZ#1621255)

6.2. Ceph Dashboard

The 'iSCSI Overview' page does not disply correctly

When using the Red Hat Ceph Storage Dashboard, the 'iSCSI Overview' page does not display any graphs or values as it is expected to.

(BZ#1595288)

Ceph OSD encryption summary is not displayed in the Red Hat Ceph Storage Dashboard

On the Ceph OSD Information dashboard, under the OSD Summary panel, the OSD Encryption Summary information is not displayed. Currently, there is no work around for this issue.

(BZ#1605241)

The Prometheus node-exporter service is not removed after doing a purge

When doing a purge of the Red Hat Ceph Storage Dashboard, the node-exporter service is not removed, and is still running. To work around this issue, you must manually stop and remove the node-exporter service.

Do the following as root:

# systemctl stop prometheus-node-exporter
# systemctl disable prometheus-node-exporter
# rpm -e prometheus-node-exporter
# reboot

For Ceph Monitor, OSD, Object Gateway, MDS, and Dashboard nodes, reboot these one at a time.

(BZ#1609713)

The OSD node details are not displayed in the Host OSD Breakdown panel

In the Red Hat Ceph Storage Dashboard, the Host OSD Breakdown information is not displayed on the OSD Node Detail panel under All.

(BZ#1610876)

Red Hat Ceph Storage Dashboard does not reflect correct OSDs

Currently, in the Ceph Cluster dashboard in some situations the Cluster Configuration tab can show the wrong number of OSDs. To work around this issue open the Ceph OSD Information dashboard and view the OSD Summary tab for the correct number of OSDs.

(BZ#1627725)

6.3. ceph-volume Utility

Using custom storage cluster names fails to start OSDs

When using a custom storage cluster name other than ceph, the OSDs might not start after a reboot.

To work around this issue, either do not use custom names when creating a new storage cluster, or create a symbolic link with the same name as the default configuration file name (/etc/ceph/ceph.conf) pointing to the custom named configuration file:

# mv /etc/ceph/ceph.conf /etc/ceph/ceph.conf.backup
# ln -s /etc/ceph/<custom-name>.conf /etc/ceph/ceph.conf

As a result, the OSDs will start properly.

(BZ#1621901)

6.4. iSCSI Gateway

Using Ceph Ansible to deploy the iSCSI gateway does not allow the user to adjust the max_data_area_mb option

Setting the max_data_area_mb option with Ceph Ansible will set a default value of 8 MB. To adjust this value, you must set it manually using the gwcli command. See the Red Hat Ceph Storage Block Device Guide for details on setting the max_data_area_mb option.

(BZ#1613826)

An iSCSI device is busy according to the systemd-udevd service

In Red Hat Enterprise Linux 7.5, the kernel’s ALUA layer reduced the number of times an initiator retries the SCSI sense code ALUA State Transition. This code is returned from the target side by the tcmu-runner service when taking the RBD exclusive lock during a failover or failback scenario and when doing a device discovery. As a consequence, the maximum number of retries occurs before the discovery process has completed, and the SCSI layer will return a failure to the multipath IO layer. The multipath IO layer will try the next available path, and the same problem will occur. This causes a loop of path checking, resulting in failed IO, and management operations to the multipath device to fail. The logs on the initiator node will print messages about devices being removed and then re-added. To workaround this issued, downgrade the initiator’s kernel to Red Hat Enterprise Linux 7.4.

(BZ#1623601)

Rebooting an iSCSI initiator with connected devices leads to an error

During device and path setup, the initiator will send commands to all paths at the same time. This will cause the Ceph iSCSI gateways to take the RBD lock from one device and set it on another device. In some cases the iSCSI gateway will interpret the lock being taken away in this manner, as a hard error and escalate its error handler by dropping the iSCSI connection, reopening the RBD devices to clear old states, and then enabling the iSCSI target port group to allow a new iSCSI connection. When disabling and enabling the iSCSI target port group this will cause a disruption to the device and path discovery. In turn, this will cause the multipath IO layer to continually disable and enable all paths and IO is suspended, or device and path discovery can fail and the device is not setup. Currently, there is no workaround for this issue.

(BZ#1623650)

6.5. Object Gateway

The Ceph Object Gateway requires applications to write sequentially

The Ceph Object Gateway requires applications to write sequentially from offset 0 to the end of a file. Attempting to write out of order causes the upload operation to fail. To work around this issue, use utilities like cp, cat, or rsync when copying files into NFS space. Always mount with the sync option.

(BZ#1492589)

RGW garbage collection fails to keep pace during evenly balanced delete-write workloads

In testing during an evenly balanced delete-write (50% / 50%) workload the cluster fills completely in eleven hours. Object Gateway garbage collection fails to keep pace. This causes the cluster to fill completely and the status switches to HEALTH_ERR state. Aggressive settings for the new parallel/async garbage collection tunables did significantly delay the onset of cluster fill in testing, and can be helpful for many workloads. Typical real world cluster workloads are not likely to cause a cluster fill due primarily to garbage collection.

(BZ#1595833)

RGW garbage collection decreases client performance by up to 50% during mixed workload

In testing during a mixed workload of 60% reads, 16% writes, 14% deletes, and 10% lists, at 18 hours into the testing run, client throughput and bandwidth drop to half their earlier levels.

(BZ#1596401)

Large objects handled incorrectly on versioned swift containers

During uploads of large objects to versioned swift containers, please use the option --leave-segments in the upload using python-swiftclient. Not using this option will lead to an overwrite of the manifest file in which case an existing object is overwritten, leading to data loss.

(BZ#1601876)

6.6. RADOS

High object counts can degrade IO performance

The overhead with directory merging on FileStore can degrade the client’s IO performance for pools with high object counts.

To work around this issue, use the ‘expected_num_objects’ option during pool creation. Creating pools is described in the Red Hat Ceph Storage Object Gateway for Production Guide.

(BZ#1592497)

When two or more RADOS Gateway daemons have the same name in a cluster Ceph Manager can crash

Currently, Ceph Manager can crash if some RADOS Gateway daemons have the same name. The following assert will be generated in this case:

DaemonPerfCounters::update(MMgrReport*)

To work around this issue, rename all the RADOS Gateway daemons that have the same name with new unique names.

(BZ#1634964)