Kernel selects wrong block I/O scheduler for Controller and Compute node disks

Solution Verified - Updated -

Issue

  • After updating the overcloud to Red Hat OpenStack 16.2, a severe storage performance degradation on Controller and Compute nodes was observed in a number of different and sometimes catastrophic ways:

    • running sos report --all-logs sometimes hangs the root filesystem long enough to get the node marked as unresponsive by Pacemaker, triggering a fencing event.

    • qemu-img convert tasks performed by cinder-volume have slowed down so much that Nova frequently times out while waiting for boot volumes to become available unless the converted image is already hot in the cinder volume-image cache.

    • RabbitMQ startup after a controller reboot takes a very very long time and sometimes Pacemaker declares the resource FAILED.

    • Galera SSTs (State Snapshot Transfers) and Galera ISTs (State Snapshot Transfers) likewise take an insanely long amount of time (tens of minutes), causing Pacemaker to declare the receiver FAILED.

    • Sometimes even the donor Galeranode gets declared FAILED, causing Pacemaker to suddenly terminate the entire Galera cluster and leaving us no other way of bringing it up than a manual recovery by analyzing grastate.dat on all nodes to find the one with the highest sequence number and mark it as the bootstrap node.

Environment

  • Red Hat OpenStack 16.2.

Subscriber exclusive content

A Red Hat subscription provides unlimited access to our knowledgebase, tools, and much more.

Current Customers and Partners

Log in for full access

Log In

New to Red Hat?

Learn more about Red Hat subscriptions

Using a Red Hat product through a public cloud?

How to access this content