Swift failed in one controller because of filesystem full of quarantined files

Solution Verified - Updated -

Environment

  • Red Hat OpenStack Platform 16.2.4

Issue

Why the service tripleo_swift_rsync_healthcheck is not running on this controller with the message "No space left on device"?

  • The swift list command in this controller returns no objects.
  • Swift quarantine directory use most of the space on device.
  • Swift not syncing in this controller because of full filesystem.

Resolution

This procedure is focused on remove unnecessary files to be able to restart the Swift service and to troubleshooting why the objects are being quarantined.

  1. Confirm that the Swift ring is working fine on the others 2 controllers by checking the md5sum of the ring and swift configuration. If not, this need to be solved first with a support case and this procedure is not applicable.
(controller) $ sudo systemctl list-units openstack-swift*
(controller) $ sudo podman exec -it -u swift swift_object_server /usr/bin/swift-recon --all

Should include in the output that there 3 of 3 hosts matched and 0 errors like below:

[2023-03-23 16:40:53] Checking ring md5sums
3/3 hosts matched, 0 error[s] while checking hosts.
===============================================================================
[2023-03-23 16:40:53] Checking swift.conf md5sum
3/3 hosts matched, 0 error[s] while checking hosts.
  1. Identify if there are old objects quarantined, files in quarantine for more than one month can be safely removed.
find /srv/node/sdb/quarantined/objects/* -mtime +31 -type f
rm -R /srv/node/sdb/quarantined/objects/<directory of the file inside objects>
  1. Identify if there are duplicate objects quarantined (have a "-" after ID).
cd /srv/node/sdb/quarantined/objects/
ls -d *-* | wc -l
  1. Compare at least 2 quarantined objects to their duplicated ones.
(controller) $ sudo podman exec -it -u swift swift_object_server swift-get-info <object path>
(controller) $ sudo podman exec -it -u swift swift_object_server swift-get-info <duplicated object path>
(controller) $ md5sum <object path>
(controller) $ md5sum <duplicated object path>
  1. Stop all swift services on the controller.
(controller) $ sudo systemctl stop $(systemctl list-units --no-legend tripleo_swift* --all | awk '{print $1}')
  1. Delete the duplicates only.
find /srv/node/sdb/quarantined/objects/* -name '*-*' -type f -exec rm -R {} \;
  1. Restart all swift services on the controller.
(controller) $ sudo systemctl start $(systemctl list-units --no-legend tripleo_swift* --all | awk '{print $1}')
  1. Monitor the Swift logs and the filesystem for at least 30min.
(controller) $ tail -f /var/log/containers/swift/swift.log
(controller) $ df
  1. If the issue is not solved, is possible to identify what make the objects go to quarantine in the log. if needed, enable debug for swift and open a support case.
crudini --set var/lib/config-data/puppet-generated/swift/etc/swift/swift.conf DEFAULT debug true
systemctl restart $(systemctl list-units --no-legend tripleo_swift* --all | awk '{print $1}')

Diagnostic Steps

  1. Check the use of the controller's filesystems.
(controller) $ df
  1. Check the Swift quarantine directory size.
(controller) $ du -sh /srv/node/sdb/quarantined/objects/
  1. Check the service tripleo_swift_rsync_healthcheck.
(controller) $ systemctl status tripleo_swift_rsync_healthcheck.service
  1. Check Swift sync logs.
(controller) $ grep 'No space left on device' /var/log/containers/swift/swift.log
(controller) $ grep 'rsync error: error in file IO' /var/log/containers/swift/swift.log

This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.

Comments