Chapter 5. Tuning the RHEL-OSP 7 Environment using Browbeat
Once the initial baseline benchmarking results using Rally are captured, the focus shifts to tuning the RHEL-OSP 7 environment. Because of the complexities involved in tuning a RHEL-OSP environment, the Red Hat Performance team has created an opensource project named Browbeat. Browbeat is set a scripts and Ansible playbooks to help determine different performance characteristics of RHEL-OSP. One of the key features of Browbeat is its Ansible playbooks that perform overcloud checks across the environment. This reference environment uses the overcloud checks to determine if the different OpenStack services are running optimally.
Browbeat is not supported and is provided as helper tool to identify common performance issues.
5.1. Installation of Browbeat
Within the rally VM, as the stack user,
If not already installed, install
gitvia theyumcommand:$ yum install gitInstall the
ansiblepackage viayum$ yum install ansibleChange into a working directory to
git clonethe browbeat repository$ mkdir /path/to/mybrowbeat $ cd /path/to/mybrowbeat $ git clone https://github.com/jtaleric/browbeat.git
Install the public key into the
stackusers authorized_keys file$ ssh-copy-id stack@<undercloud-ip>Run the script labeled gen_hosts.sh to generate an overcloud hosts file for Ansible and generate a jumpbox configuration file. This example calls the host file 'env_machines' and can reside anywhere within the Rally VM.
$ ./gen_hostfile.sh <undercloud-ip> /path/to/env_machinesExport the env_machines file as an Ansible
sshargument:$ export ANSIBLE_SSH_ARGS='-F ./path/to/env_machines'Run the check playbook to identify common performance issues:
$ ansible-playbook -i hosts check/site.yml
5.2. Analyzing Performance Issues generated by the Ansible Playbook
Browbeat’s check playbook verifies if certain settings are properly set within the RHEL-OSP 7 environment. It is important to note that the following verifications will continue to increase as the project develops. However, at the time of this writing the current verifications are specified below.
Browbeat’s verification checks are separated by roles. The roles are broken up as follows:
-
common- verification checks that are common to RHEL-OSP 7 Director, Controller, Compute, and Ceph nodes. -
controller- verification checks that are specific to the RHEL-OSP 7 Controller nodes -
compute- verification checks that are specific to the RHEL-OSP 7 Compute nodes -
ceph- verification checks that are specific to the RHEL-OSP 7 Ceph nodes -
keystone- verification checks that are specific to the RHEL-OSP 7 Keystone service
A further breakdown of the verification checks for each role consists of the following:
Browbeat’s check playbook for RHEL-OSP 7 common nodes:
- Verify SELinux status for each host
-
Verify if the
tuneddaemon is running on each host -
Verify the current running
tunedprofile for each host
Browbeat’s check playbook for RHEL-OSP 7 Keystone service:
-
Verify if a
cronjob exists that will remove expired tokens from the MariaDB database as an accumlation of tokens can cause performance degration. -
Check the
keystonetoken provider, which by default is UUID.
Browbeat’s check playbook for RHEL-OSP 7 controller nodes:
- Verify the maximum allowed connections for the MariaDB database is sufficient.
- Verify the amount of file descriptors assigned for RabbitMQ is set to 16384.
-
Verify
novaandneutronare communicating about VIF Plugging - Verify HAProxy default max connections
- Verify netns tuning
- Verify if RabbitMQ contains partitions
The current verification checks for RHEL-OSP 7 compute nodes and Ceph nodes are currently limited. However, with the rapid development of this project, this should change in the near future.
An example output of the overcloud checks from Browbeat.
# Browbeat generated bug report --------------------------------------- | Issues for host : overcloud-controller-0 --------------------------------------- Bug: bz1095811 Name: Network connectivity issues after 1000 netns URL: https://bugzilla.redhat.com/show_bug.cgi?id=1095811 Bug: nova_vif_timeout_result Name: Nova VIF timeout should be >= 300 URL: none Bug: bz1264740 Name: RHEL OSP Director must be configure with nova-event-callback by default URL: https://bugzilla.redhat.com/show_bug.cgi?id=1264740 --------------------------------------- | Issues for host : overcloud-controller-1 --------------------------------------- Bug: bz1095811 Name: Network connectivity issues after 1000 netns URL: https://bugzilla.redhat.com/show_bug.cgi?id=1095811 Bug: nova_vif_timeout_result Name: Nova VIF timeout should be >= 300 URL: none Bug: bz1264740 Name: RHEL OSP Director must be configure with nova-event-callback by default URL: https://bugzilla.redhat.com/show_bug.cgi?id=1264740 --------------------------------------- | Issues for host : overcloud-controller-2 --------------------------------------- Bug: bz1095811 Name: Network connectivity issues after 1000 netns URL: https://bugzilla.redhat.com/show_bug.cgi?id=1095811 Bug: nova_vif_timeout_result Name: Nova VIF timeout should be >= 300 URL: none Bug: bz1264740 Name: RHEL OSP Director must be configure with nova-event-callback by default URL: https://bugzilla.redhat.com/show_bug.cgi?id=1264740 --------------------------------------- | Issues for host : overcloud-novacompute-0 --------------------------------------- Bug: bz1282644 Name: increase reserved_host_memory_mb URL: https://bugzilla.redhat.com/show_bug.cgi?id=1282644 Bug: bz1264740 Name: RHEL OSP Director must be configure with nova-event-callback by default URL: https://bugzilla.redhat.com/show_bug.cgi?id=1264740 Bug: tuned_profile_result Name: Ensure Tuned Profile is set to virtual-host URL: none Bug: bz1245714 Name: No Swap Space allocated URL: https://bugzilla.redhat.com/show_bug.cgi?id=1245714 Bug: nova_vif_timeout_result Name: Nova VIF timeout should be >= 300 URL: none --------------------------------------- | Issues for host : overcloud-novacompute-1 --------------------------------------- Bug: bz1282644 Name: increase reserved_host_memory_mb URL: https://bugzilla.redhat.com/show_bug.cgi?id=1282644 Bug: bz1264740 Name: RHEL OSP Director must be configure with nova-event-callback by default URL: https://bugzilla.redhat.com/show_bug.cgi?id=1264740 Bug: tuned_profile_result Name: Ensure Tuned Profile is set to virtual-host URL: none Bug: bz1245714 Name: No Swap Space allocated URL: https://bugzilla.redhat.com/show_bug.cgi?id=1245714 Bug: nova_vif_timeout_result Name: Nova VIF timeout should be >= 300 URL: none
With the output generated by Browbeat, users can tweak their existing RHEL-OSP environment to fix issues that might ultimately provide better performance and scalability. For demonstration purposes, this reference environment takes a closer look at the following bugs reported by Browbeat:
--------------------------------------- | Issues for host : overcloud-novacompute-1 --------------------------------------- Bug: bz1282644 Name: increase reserved_host_memory_mb URL: https://bugzilla.redhat.com/show_bug.cgi?id=1282644 AND Bug: tuned_profile_result Name: Ensure Tuned Profile is set to virtual-host URL: none
The first bug BZ11282644, references the parameter reserved_host_memory_mb that is found within the compute node’s /etc/nova/nova.conf file. The parameter reserved_host_memory_mb ensures that X amount of megabytes are reserved for the hypervisor and cannot be used as additional RAM for launching guest instances. The default value of reserved_host_memory_mb is currently 512 MB of RAM. However, as guest instances are launched and the hypervisor reaches its 512 MB threshold, this leads to a potential out of memory (OOM) situation were guest instances and/or processes can be killed within the RHEL-OSP environment as 512 MB of RAM for the hypervisor is not enough memory run the environment. In order to alleviate potential scenarios with OOM, it is recommended to provide each compute node within the RHEL-OSP 7 environment at least 2048 MB of reserved host memory.
Besides increasing the reserved host memory for the hypervisor, how does the reserved_host_memory_mb parameter effect performance and/or scalability?
By increasing the amount of RAM the hypervisor is reserving, it limits the amount of launchable guest instances. This can be clearly seen when running the nova hypervisor-stats command.
$ nova hypervisor-stats +----------------------+-------+ | Property | Value | +----------------------+-------+ | count | 1 | | current_workload | 0 | | disk_available_least | 36949 | | free_disk_gb | 37001 | | free_ram_mb | 93912 | | local_gb | 37021 | | local_gb_used | 20 | | memory_mb | 96472 | | memory_mb_used | 2560 | | running_vms | 1 | | vcpus | 32 | | vcpus_used | 1 | +----------------------+-------+
In the example above, memory_mb_used is 2560 MB. With one virtual machine currently running taking 2048MB, this leaves exactly 512 MB of RAM being used which is reserved by the hypervisor which totals 2560. To calculate amount of launchable guests: ((total_ram_mb - reserved_host_memory_mb) * memory ovecommit) / RAM of each instance
In this scenario, ((96472 - 512) * 1 )/ 2048 = 46 guests
Default memory overcommit is 1:1 within RHEL-OSP 7
Once the reserved memory for the hypervisor is increased from 2048 mb from 512 mb, nova hypervisor-stats reveals:
$ nova hypervisor-stats +----------------------+-------+ | Property | Value | +----------------------+-------+ | count | 1 | | current_workload | 0 | | disk_available_least | 36949 | | free_disk_gb | 37001 | | free_ram_mb | 92376 | | local_gb | 37021 | | local_gb_used | 20 | | memory_mb | 96472 | | memory_mb_used | 4096 | | running_vms | 1 | | vcpus | 32 | | vcpus_used | 1 | +----------------------+-------+
In this scenario, total memory used is 4096 which includes the 2048 reserved for the hypervisor and the one virtual machine that is currently taking up 2048 for a total of 4096 mb. To calculate the amount of launchable guests using the same formula from above: ((96472 - 2048) * 1 )/ 2048 = 45 guests
While the impact in this scenario is small (losing 1 virtual machine when using only 1 compute node), as compute nodes are added and different flavors for guest instances are used, it can have a larger overall impact of instances that can be launched.
The last issue that is to be addressed in this example is the tuned_profile_result. The tuned package in Red Hat Enterprise Linux 7 is recommended for automatically tuning the system for common workloads via the use of profiles. Each profile is tailored for different workload scenarios such as: throughput-performance, virtual-host, and virtual-guest.
In the chapter Chapter 4, Analyzing RHEL-OSP 7 Benchmark Results with Rally, we discuss how the RHEL-OSP environment performed and scaled with and without the default settings prior to tuning with Browbeat. To summarize, when doing the boot-storm tests, as the concurrency value increased closer to the maximum amount of guest instances that the environment could launch it caused high CPU wait times due to the Ceph nodes not being able to keep up with the amount of guest instances being booted simulatenousely. When looking at achieving max guests with a low concurrency, the RHEL-OSP environment is able to achieve reaching the max guests, however, at times the average and max boot times were higher than expected.
When running the tests within the Chapter 4, Analyzing RHEL-OSP 7 Benchmark Results with Rally, all the Controllers, Compute, and Ceph nodes ran with the tuned profile throughput-performance. Below is the example results when running the 2 compute nodes with 8 concurrency 91 times.
2-compute-node-91times-8concurrency-vif-plugging-timeout-0
+---------------------------------------------------------------------------------------------+ | Response Times (sec) | +-------------------+--------+--------+--------+---------+---------+--------+---------+-------+ | action | min | median | 90%ile | 95%ile | max | avg | success | count | +-------------------+--------+--------+--------+---------+---------+--------+---------+-------+ | nova.boot_server | 24.637 | 32.123 | 45.629 | 117.858 | 152.226 | 41.814 | 100.0% | 91 | | nova.list_servers | 0.412 | 1.23 | 1.741 | 1.84 | 2.632 | 1.2 | 100.0% | 91 | | total | 26.317 | 33.458 | 46.451 | 118.356 | 152.743 | 43.015 | 100.0% | 91 | +-------------------+--------+--------+--------+---------+---------+--------+---------+-------+ Load duration: 500.732126951 Full duration: 652.159901142
While a very high success rate is achieved, the question lies Does the performance and scalability get impacted if we implement Browbeat’s recommendation? While Browbeat’s bug report only mentions changing the tuned profile for compute nodes, further research found within this Red Hat Performance Brief: http://bit.ly/20vAp0c shows that the following tuned recommendations can potentially show an increase in performance. They include:
- Controller nodes - throughtput-performance profile
- Compute nodes - virtual-host profile
- Ceph nodes - virtual-host profile
When re-running the max guest launch test with the above changes on each node keeping the 8 concurrency, 91 times, and using 2 compute nodes the results are:
2-compute-node-91times-8concurrency-vif-plugging-timeout-0-tuned-profile-changes
+-------------------------------------------------------------------------------------------+ | Response Times (sec) | +-------------------+--------+--------+--------+--------+--------+--------+---------+-------+ | action | min | median | 90%ile | 95%ile | max | avg | success | count | +-------------------+--------+--------+--------+--------+--------+--------+---------+-------+ | nova.boot_server | 24.791 | 31.464 | 34.104 | 35.382 | 36.264 | 31.352 | 100.0% | 91 | | nova.list_servers | 0.38 | 1.223 | 1.867 | 1.925 | 2.355 | 1.227 | 100.0% | 91 | | total | 27.147 | 32.774 | 34.858 | 36.306 | 37.318 | 32.579 | 100.0% | 91 | +-------------------+--------+--------+--------+--------+--------+--------+---------+-------+ Load duration: 388.272264957 Full duration: 504.665802002
Referencing just the minimum and average values, the minimum boot time increased less than 1% percent but the key is average boot time decreasing by 25% percent!
The original boot-storm test with the same two compute node configuration with 91 concurrency and 91 times reported:
2-compute-node-91times-91concurrency-vif-plugging-timeout-0
+------------------------------------------------------------------------------------------------+
| Response Times (sec) |
+-------------------+--------+---------+---------+---------+---------+---------+---------+-------+
| action | min | median | 90%ile | 95%ile | max | avg | success | count |
+-------------------+--------+---------+---------+---------+---------+---------+---------+-------+
| nova.boot_server | 43.502 | 171.631 | 295.4 | 309.489 | 314.713 | 168.553 | 100.0% | 91 |
| nova.list_servers | 1.355 | 1.961 | 2.522 | 2.565 | 3.024 | 2.031 | 94.5% | 91 |
| total | 45.928 | 167.625 | 274.446 | 291.547 | 307.154 | 162.119 | 94.5% | 91 |
+-------------------+--------+---------+---------+---------+---------+---------+---------+-------+
Load duration: 315.15325284
Full duration: 499.308477163
HINTS:
* To plot HTML graphics with this data, run:
rally task report 04d9e8aa-da94-4724-a904-37abca471543 --out output.html
* To generate a JUnit report, run:
rally task report 04d9e8aa-da94-4724-a904-37abca471543 --junit --out output.xml
* To get raw JSON output of task results, run:
rally task results 04d9e8aa-da94-4724-a904-37abca471543
When re-running the same environment but with the tuned profile changes across all the nodes, the results show:
2-compute-node-91times-91concurrency-vif-plugging-timeout-0-tuned-profile-changes
+------------------------------------------------------------------------------------------------+ | Response Times (sec) | +-------------------+--------+---------+---------+---------+---------+---------+---------+-------+ | action | min | median | 90%ile | 95%ile | max | avg | success | count | +-------------------+--------+---------+---------+---------+---------+---------+---------+-------+ | nova.boot_server | 43.39 | 152.542 | 259.255 | 270.847 | 275.094 | 166.24 | 100.0% | 91 | | nova.list_servers | 1.525 | 1.967 | 2.436 | 2.582 | 3.285 | 2.025 | 100.0% | 91 | | total | 45.294 | 154.223 | 261.232 | 272.594 | 276.72 | 168.266 | 100.0% | 91 | +-------------------+--------+---------+---------+---------+---------+---------+---------+-------+ Load duration: 276.911462784 Full duration: 398.153054953
Referencing just the minimum and average values, the minimum boot time decreased less than 1% percent but the key is average boot time decreasing by 13.7% percent. While the decrease in the average boot time isn’t as drastic as the previous example with a lower concurrency rate, this is due to the Ceph storage unable to keep up with the workload. However, it is important to mention that even with the Ceph storage being the bottleneck, an increase in performance was obtained changing the tuned profiles. As these results show, implementing the different recommendations for tuning using Browbeat can have a great impact in overall performance and scalability.

Where did the comment section go?
Red Hat's documentation publication system recently went through an upgrade to enable speedier, more mobile-friendly content. We decided to re-evaluate our commenting platform to ensure that it meets your expectations and serves as an optimal feedback mechanism. During this redesign, we invite your input on providing feedback on Red Hat documentation via the discussion platform.