Kernel selects wrong block I/O scheduler for Controller and Compute node disks
Issue
-
After updating the overcloud to Red Hat OpenStack 16.2, a severe storage performance degradation on Controller and Compute nodes was observed in a number of different and sometimes catastrophic ways:
-
running
sos report --all-logssometimes hangs the root filesystem long enough to get the node marked as unresponsive by Pacemaker, triggering a fencing event. -
qemu-img converttasks performed bycinder-volumehave slowed down so much thatNovafrequently times out while waiting for boot volumes to becomeavailableunless the converted image is already hot in the cindervolume-imagecache. -
RabbitMQstartup after a controller reboot takes a very very long time and sometimesPacemakerdeclares the resource FAILED. -
Galera SSTs(State Snapshot Transfers) andGalera ISTs(State Snapshot Transfers) likewise take an insanely long amount of time (tens of minutes), causing Pacemaker to declare the receiver FAILED. -
Sometimes even the
donor Galeranode gets declared FAILED, causingPacemakerto suddenly terminate the entireGaleracluster and leaving us no other way of bringing it up than a manual recovery by analyzinggrastate.daton all nodes to find the one with the highest sequence number and mark it as the bootstrap node.
-
Environment
- Red Hat OpenStack 16.2.
Subscriber exclusive content
A Red Hat subscription provides unlimited access to our knowledgebase, tools, and much more.