Chapter 6. Resource Isolation and Tuning

This chapter is similar to the Chapter 5, Define the Overcloud chapter covered previously in that it should result in changes made to the Heat enviornment files in the ~/custom-templates directory. However, it differs in that the changes are made not to define the overcloud but to tune it in order to improve performance and isolate resources.

Isolating resources is important in a hyper-converged deployment because contention between Ceph and OpenStack could result in degradation of either service, and neither service is aware of the other’s presence on the same physical host.

6.1. Nova Reserved Memory and CPU Allocation Ratio

In this section the reasoning behind how to tune the Nova settings for reserved_host_memory_mb and cpu_allocation_ratio is explained. A Python program is provided which takes as input properties of the hardware and planned workload and recommends the reserved_host_memory_mb and cpu_allocation_ratio. The settings provided favor making a hyper-converged deployment stable over maximizing the number of possible guests. Red Hat recommends starting with these defaults and testing specific workloads targeted at the OpenStack deployment. If necessary, these settings may be changed to find the desired trade off between determinism and guest-hosting capacity. The end of this section covers how to deploy the settings using Red Hat OpenStack Platform director.

6.1.1. Nova Reserved Memory

Nova’s reserved_host_memory_mb is the amount of memory in MB to reserve for the host. If a node is dedicated only to offering compute services, then this value should be set to maximize the number of running guests. However, on a system that must also support Ceph OSDs, this value needs to be increased so that Ceph has access to the memory that it needs.

To determine the reserved_host_memory_mb for a hyper-converged node, assume that each OSD consumes 3GB of RAM. Given a node with 256GB of RAM and 10 OSDs, 30GB of RAM is used for Ceph and 226GB of RAM is available for Nova. If the average guest uses the m1.small flavor, which uses 2GB of RAM per guest, then the overall system could host 113 such guests. However there is an additional overhead to account for per guest for the hypervisor. Assume this overhead is a half GB. With this overhead taken into account, the maximum number of 2GB guests that could be run would be 226GB divided by 2.5GB of RAM, which is approximately 90 virtual guests.

Given this number of guests and the number of OSDs, the amount of memory to reserve that Nova cannot use would be the amount of guests times their overhead plus the amount of OSDs times the amount of RAM that each OSD should have. In other words, (90*0.5) + (10*3), which is 75GB. Nova expects this value in MB and thus 75000 would be provided to the nova.conf.

These ideas may be expressed mathematically in the following Python code:

left_over_mem = mem - (GB_per_OSD * osds)
number_of_guests = int(left_over_mem /
                       (average_guest_size + GB_overhead_per_guest))
nova_reserved_mem_MB = MB_per_GB * (
                        (GB_per_OSD * osds) +
                        (number_of_guests * GB_overhead_per_guest))

The above is from the Nova Memory and CPU Calculator, which is covered in a future section of this paper.

6.1.2. Nova CPU Allocation Ratio

Nova’s cpu_allocation_ratio is used by the Nova scheduler when choosing compute nodes to run guests. If the ratio has the default of 16:1 and the number of cores on a node, also known as vCPUs, is 56, then the Nova scheduler may schedule enough guests to consume 896 vCPUs before it considers the node unable to handle any more guests. Because the Nova scheduler does not take into account the CPU needs of Ceph OSD services running on the same node, the cpu_allocation_ratio should be modified so that Ceph has the CPU resources it needs to operate effectively without those CPU resources being given to Nova.

To determine the cpu_allocation_ratio for a hyper-converged node, assume that at least one core is used by each OSD (unless the workload is IO intensive). Given a node with 56 cores and 10 OSDs, that leaves 46 cores for Nova. If each guest uses 100% of the CPU that it is given, then the ratio should be the number of guest vCPUs divided by the number of cores; that is, 46 divided by 56, or 0.8. However, because guests don’t usually consume 100% of their CPUs, the ratio should be raised by taking the anticipated percentage into account when determining the number of required guest vCPUs. So, if only 10%, or 0.1, of a vCPU is used by a guest, then the number of vCPUs for guests is 46 divided by 0.1, or 460. When this value is divided by the number of cores, 56, the ratio increases to approximately 8.

These ideas may be expressed mathematically in the following Python code:

 cores_per_OSD = 1.0
 average_guest_util = 0.1 # 10%
 nonceph_cores = cores - (cores_per_OSD * osds)
 guest_vCPUs = nonceph_cores / average_guest_util
 cpu_allocation_ratio = guest_vCPUs / cores

The above is from the Nova Memory and CPU Calculator covered in the next section.

6.1.3. Nova Memory and CPU Calculator

The formulas covered above are in a script called nova_mem_cpu_calc.py, which is available in Appendix: Nova Memory and CPU Calculator. It takes the following ordered parameters as input:

  1. Total host RAM in GB
  2. Total host cores
  3. Ceph OSDs per server
  4. Average guest size in GB
  5. Average guest CPU utilization (0.0 to 1.0)

It prints as output a recommendation for how to set the nova.confreserved_host_memory_mb and cpu_allocation_ratio to favor stability of a hyper-converged deployment. When the numbers from the example discussed in the previous section are provided to the script, it returns the following results.

$ ./nova_mem_cpu_calc.py 256 56 10 2 1.0
Inputs:
- Total host RAM in GB: 256
- Total host cores: 56
- Ceph OSDs per host: 10
- Average guest memory size in GB: 2
- Average guest CPU utilization: 100%

Results:
- number of guests allowed based on memory = 90
- number of guest vCPUs allowed = 46
- nova.conf reserved_host_memory = 75000 MB
- nova.conf cpu_allocation_ratio = 0.821429

Compare "guest vCPUs allowed" to "guests allowed based on memory" for actual guest count
$

The amount of possible guests is bound by the limitations of either the CPU or the memory of the overcloud. In the example above, if each guest is using 100% of its CPU and there are only 46 vCPUs available, then it is not possible to launch 90 guests, even though there is enough memory to do so. If the anticipated guest CPU utilization decreases to only 10%, then the number of allowable vCPUs increases along with the cpu_allocation_ratio.

$ ./nova_mem_cpu_calc.py 256 56 10 2 0.1
Inputs:
- Total host RAM in GB: 256
- Total host cores: 56
- Ceph OSDs per host: 10
- Average guest memory size in GB: 2
- Average guest CPU utilization: 10%

Results:
- number of guests allowed based on memory = 90
- number of guest vCPUs allowed = 460
- nova.conf reserved_host_memory = 75000 MB
- nova.conf cpu_allocation_ratio = 8.214286

Compare "guest vCPUs allowed" to "guests allowed based on memory" for actual guest count
$

After determining the desired values of the reserved_host_memory_mb and cpu_allocation_ratio, proceed to the next section to apply the new settings.

6.1.4. Change Nova Reserved Memory and CPU Allocation Ratio with Heat

Create the new file ~/custom-templates/compute.yaml containing the following:

parameter_defaults:
  ExtraConfig:
    nova::compute::reserved_host_memory: 75000
    nova::cpu_allocation_ratio: 8.2

In the above example ExtraConfig is used to change the amount of memory that the Nova compute service reserves in order to protect both the Ceph OSD service and the host itself. Also, in the above example, ExtraConfig is used to change the Nova CPU allocation ratio of the Nova scheduler service so that it does not allocate any of the CPUs that the Ceph OSD service uses.

Tip

Red Hat OpenStack Platform director refers to the reserved_host_memory_mb variable used by Nova as reserved_host_memory.

To verify that the reserved host memory and CPU allocation ratio configuration changes were applied after Chapter 7, Deployment, ssh into any of the OsdCompute nodes and look for the configuration change in the nova.conf.

[root@overcloud-osd-compute-0 ~]# grep reserved_host_memory /etc/nova/nova.conf
reserved_host_memory_mb=75000
[root@overcloud-osd-compute-0 ~]#
[root@overcloud-osd-compute-0 ~]# grep cpu_allocation_ratio /etc/nova/nova.conf
cpu_allocation_ratio=8.2
[root@overcloud-osd-compute-0 ~]#

6.1.5. Updating the Nova Reserved Memory and CPU Allocation Ratio

The Overcloud workload may vary over time so it is likely that the reserved_host_memory and cpu_allocation_ratio will need to be changed. To do so after Chapter 7, Deployment, simply update the the values in compute.yaml and re-run the deployment command covered in Chapter 7, Deployment. More details on overcloud updates are in the Section 8.1, “Configuration Updates” section.

6.2. Ceph NUMA Pinning

For systems which run both Ceph OSD and Nova Compute services, determinism can be improved by pinning Ceph to one of the available two NUMA nodes in a two socket x86 server. The socket to which Ceph should be pinned is the one that has the network IRQ and the storage controller. This choice is made because of a Ceph OSD’s heavy use of network IO. The steps below describe how to create a Red Hat OpenStack Platform director post deployscript so that Ceph OSD daemons are NUMA pinned to a particular CPU socket when they are started.

6.2.1. Update the Post Deploy Script

In the Section 5.4, “Ceph Configuration” section, a post-deploy-template was added to the resource registry of ceph.yaml. That post deploy template originally contained only the following:

heat_template_version: 2014-10-16

parameters:
  servers:
    type: json

resources:

  ExtraConfig:
    type: OS::Heat::SoftwareConfig
    properties:
      group: script
      inputs:
        - name: OSD_NUMA_INTERFACE
      config: |
        #!/usr/bin/env bash
        {
        echo "TODO: pin OSDs to the NUMA node of $OSD_NUMA_INTERFACE"
        } 2>&1 > /root/post_deploy_heat_output.txt

  ExtraDeployments:
    type: OS::Heat::SoftwareDeployments
    properties:
      servers: {get_param: servers}
      config: {get_resource: ExtraConfig}
      input_values:
        OSD_NUMA_INTERFACE: 'em2'
      actions: ['CREATE']

The next two subsections will update the above file.

6.2.1.1. Set the Ceph Service Network Interface

The above Heat environment file has the following parameter:

OSD_NUMA_INTERFACE: 'em2'

Set the above to the name of the network device on which the Ceph services listen. In this reference architecture the device is em2, but the value may be determined for all deployments by either the StorageNetwork variable, or the StorageMgmtNetwork variable that was set in the Section 5.2, “Network Configuration” section. Workloads that are read-heavy benefit from using the StorageNetwork variable, while workloads that are write-heavy benefit from using the StorageMgmtNetwork variable. In this reference architecture both networks are VLANs on the same interface.

Tip

If the Ceph OSD service uses a virtual network interface, like a bond, then use the name of the network devices that make up the bond, not the bond name itself. For example, if bond1 uses em2 and em4, then set OSD_NUMA_INTERFACE to either em2 or em4, not bond1. If the OSD_NUMA_INTERFACE variable is set to a bond name, then the NUMA node will not be found and the Ceph OSD service will not be pinned to either NUMA node. This is because the lstopo command will not return virtual devices.

6.2.1.2. Modify the Shell Script

The following section of custom-templates/post-deploy-template.yaml contains a Heat config line and then embeds a shell script:

      config: |
        #!/usr/bin/env bash
        {
        echo "TODO: pin OSDs to the NUMA node of $OSD_NUMA_INTERFACE"
        } 2>&1 > /root/post_deploy_heat_output.txt

Update the above so that rather than embed a simple shell script, it instead includes a more complex shell script in a seprate file using Heat’s get_file intrinsic function.

      config: {get_file: numa-systemd-osd.sh}

The above change calls the script numa-systemd-osd.sh, which takes the network interface used for Ceph network traffic as an argument, and then uses lstopo to determine that interfaces’s NUMA node. It then modifies the systemd unit file for the Ceph OSD service so that numactl is used to start the OSD service with a NUMA policy that prefers the NUMA node of the Ceph network’s interface. It then restarts each Ceph OSD daemon sequentially so that the service runs with the new NUMA option.

When numa-systemd-osd.sh is run directly on a osd_compute node (with OSD_NUMA_INTERFACE set within the shell script), its output looks like the following:

[root@overcloud-osd-compute-0 ~]# ./numa-systemd-osd.sh
changed: --set /usr/lib/systemd/system/ceph-osd@.service Service ExecStart '/usr/bin/numactl -N 0 --preferred=0 /usr/bin/ceph-osd -f --cluster ${CLUSTER} --id %i --setuser ceph --setgroup ceph'

Status of OSD 1 before unit file update

* ceph-osd@1.service - Ceph object storage daemon
   Loaded: loaded (/usr/lib/systemd/system/ceph-osd@.service; disabled; vendor preset: disabled)
   Active: active (running) since Fri 2016-12-16 02:50:02 UTC; 11min ago
 Main PID: 83488 (ceph-osd)
   CGroup: /system.slice/system-ceph\x2dosd.slice/ceph-osd@1.service
           └─83488 /usr/bin/ceph-osd -f --cluster ceph --id 1 --setuser ceph --setgroup ceph

Dec 16 02:50:01 overcloud-osd-compute-0.localdomain systemd[1]: Starting Ceph object storage daemon...
Dec 16 02:50:02 overcloud-osd-compute-0.localdomain ceph-osd-prestart.sh[83437]: create-or-move update...
Dec 16 02:50:02 overcloud-osd-compute-0.localdomain systemd[1]: Started Ceph object storage daemon.
Dec 16 02:50:02 overcloud-osd-compute-0.localdomain numactl[83488]: starting osd.1 at :/0 osd_data /v...l
Dec 16 02:50:02 overcloud-osd-compute-0.localdomain numactl[83488]: 2016-12-16 02:50:02.544592 7fecba...}
Dec 16 03:01:19 overcloud-osd-compute-0.localdomain systemd[1]: [/usr/lib/systemd/system/ceph-osd@.s...e'
Hint: Some lines were ellipsized, use -l to show in full.

Restarting OSD 1...

Status of OSD 1 after unit file update

* ceph-osd@1.service - Ceph object storage daemon
   Loaded: loaded (/usr/lib/systemd/system/ceph-osd@.service; disabled; vendor preset: disabled)
   Active: active (running) since Fri 2016-12-16 03:01:21 UTC; 7ms ago
  Process: 89472 ExecStartPre=/usr/lib/ceph/ceph-osd-prestart.sh --cluster ${CLUSTER} --id %i (code=exited, status=0/SUCCESS)
 Main PID: 89521 (numactl)
   CGroup: /system.slice/system-ceph\x2dosd.slice/ceph-osd@1.service
           └─89521 /usr/bin/numactl -N 0 --preferred=0 /usr/bin/ceph-osd -f --cluster ceph --id 1 --se...

Dec 16 03:01:21 overcloud-osd-compute-0.localdomain systemd[1]: Starting Ceph object storage daemon...
Dec 16 03:01:21 overcloud-osd-compute-0.localdomain ceph-osd-prestart.sh[89472]: create-or-move update...
Dec 16 03:01:21 overcloud-osd-compute-0.localdomain systemd[1]: Started Ceph object storage daemon.
Hint: Some lines were ellipsized, use -l to show in full.

Status of OSD 11 before unit file update
...

The logs of the node should indicate that numactl was used to start the OSD service.

[root@overcloud-osd-compute-0 ~]# journalctl | grep numa | grep starting
Dec 16 02:50:02 overcloud-osd-compute-0.localdomain numactl[83488]: starting osd.1 at :/0 osd_data /var/lib/ceph/osd/ceph-1 /var/lib/ceph/osd/ceph-1/journal
...

When the modified post-deploy-template.yaml and numa-systemd-osd.sh described above are run in Chapter 7, Deployment, the Ceph OSD daemons on each osd-compute node will be restarted under a NUMA policy. To verify that this has been completed after the deployment, check the log with journalctl as shown above or check the output of numa-systemd-osd.sh captured in /root/post_deploy_heat_output.txt.

The full content of post-deploy-template.yaml and numa-systemd-osd.sh may be read in the Appendix on Appendix D, Custom Heat Templates and is also available online as described in the Appendix G, GitHub Repository of Example Files Appendix for more details.

6.2.1.3. Details on OSD systemd unit file update for NUMA

The numa-systemd-osd.sh script checks if the hwloc and numactl packages are installed, and if they are not installed, tries to install them with yum. To ensure these packages are available, consider either of the following options:

  • Configure Red Hat OpenStack Platform director to register the overcloud to a yum repository containing the numactl and hwloc packages by using the --rhel-reg, --reg-method, --reg-org options described in 5.7. Setting Overcloud Parameters of the Red Hat document Director Installation and Usage.
  • Before uploading the overcloud images to the undercloud Glance service, install the numactl, hwloc-libs, and hwloc packages on the overcloud image with virt-customize, as described in 24.12 virt-customize: Customizing Virtual Machine Settings from the Virtualization Deployment and Administration Guide for Red Hat Enterprise Linux 7.

The numactl package is necessary so that the Ceph OSD processes can be started with a NUMA policy. The hwloc package provides the lstopo-no-graphics command, which shows the CPU topology of the system. Rather than require the user to determine which NUMA socket Ceph should be pinned to, based on the IRQ of the $OSD_NUMA_INTERFACE, the following exmaines the system to determine the desired NUMA socket number. It uses the lstopo-no-graphics command, filters the output with grep and then loops through the output to determine which NUMA socket has the IRQ.

declare -A NUMASOCKET
while read TYPE SOCKET_NUM NIC ; do
    if [[ "$TYPE" == "NUMANode" ]]; then
	NUMASOCKET=$(echo $SOCKET_NUM | sed s/L//g);
    fi
    if [[ "$NIC" == "$OSD_NUMA_INTERFACE" ]]; then
	# because $NIC is the $OSD_NUMA_INTERFACE,
	# the NUMASOCKET has been set correctly above
	break # so stop looking
    fi
done < <(lstopo-no-graphics | tr -d [:punct:] | egrep "NUMANode|$OSD_NUMA_INTERFACE")

The tr command is used to trim away punctuation, as lstopo-no-graphics outputs the network interface in quotes. A regular expression passed to egrep shows only the lines containing the NUMANode or the $OSD_NUMA_INTERFACE defined earlier. A while loop with read is used to extract the three columns containing the desired strings. Each NUMA socket number is collected, without the preceding 'L' as per sed. The $NUMASOCKET is set for each iteration containing the NUMAnode in case, during the next iteration, the $OSD_NUMA_INTERFACE is found. When the desired network interface is found, the loop exits with break before the $NUMASOCKET variable can be set to the next NUMA socket number. If no $NUMASOCKET is found, then the script exits.

The crudini command is used to save the ExecStart section of the default OSD unit file.

CMD=$(crudini --get $UNIT Service ExecStart)

A different crudini command is then used to put the same command back for the ExecStart command, but the command has a numactl call appended to its front.

crudini --verbose --set $UNIT Service ExecStart "$NUMA $CMD"

The $NUMA variable saves the numactl call to start the OSD daemon with a NUMA policy to only execute the command on the CPUs of the $NUMASOCKET identified previously. The --preferred option and not --membind is used. This is done because testing shows that hard pinning, with --membind, can cause swapping.

NUMA="/usr/bin/numactl -N $NUMASOCKET --preferred=$NUMASOCKET"

The last thing that numa-systemd-osd.sh does is to restart all of the OSD daemons on the server.

OSD_IDS=$(ls /var/lib/ceph/osd | awk 'BEGIN { FS = "-" } ; { print $2 }')
for OSD_ID in $OSD_IDS; do
  systemctl restart ceph-osd@$OSD_ID
done

A variation of the command above is used to show the status before and after the restart. This status is saved in /root/post_deploy_heat_output.txt on each osd-compute node.

Warning

Each time the above script is run, the OSD daemons are restarted sequentially on all Ceph OSD nodes. Thus, this script is only run on create, not on update, as per the actions: ['CREATE'] line in post-deploy-template.yaml.

6.3. Reduce Ceph Backfill and Recovery Operations

When an OSD is removed, Ceph uses backill and recovery operations to rebalance the cluster. This is done in order to keep multiple copies of data according to the placement group policy. These operations use system resources, so if a Ceph cluster is under load, then its performance will drop as it diverts resources to backfill and recovery. To keep the Ceph cluster performant when an OSD is removed, reduce the priority of backfill and recovery operations. The trade off of this tuning is that there are less data replicas for a longer time and thus, the data is at a slightly greater risk.

The three variables to modify for this setting have the following meanings as defined in the Ceph Storage Cluster OSD Configuration Reference.

  • osd recovery max active: The number of active recovery requests per OSD at one time. More requests will accelerate recovery, but the requests place an increased load on the cluster.
  • osd max backfills: The maximum number of backfills allowed to or from a single OSD.
  • osd recovery op priority: The priority set for recovery operations. It is relative to osd client op priority.

To have Red Hat OpenStack Platform director configure the Ceph cluster to favor performance during rebuild over recovery speed, configure a Heat environment file with the following values:

parameter_defaults:
  ExtraConfig:
    ceph::profile::params::osd_recovery_op_priority: 2

Red Hat Ceph Storage versions prior to 2 also require thefollowing to be in the above file:

    ceph::profile::params::osd_recovery_max_active: 3
    ceph::profile::params::osd_max_backfills: 1

However, as these values are presently the defaults in version 2 and later, they do not need to be placed in the Heat environment file.

The above settings, were made to ~/custom-templates/ceph.yaml in Section 5.4, “Ceph Configuration”. If they need to be updated, then the Heat template may be updated and the openstack overcloud deply command, as covered in Chapter 7, Deployment, may be re-run and Red Hat OpenStack Platform director will update the configuration on the overcloud.

6.4. Regarding tuned

The default tuned profile for Red Hat Enerprise Linux 7 is throughput-performance. Though the virtual-host profile is recommended for Compute nodes, in the case of nodes which run both Ceph OSD and Nova Compute services, the throughput-performance profile is recommended in order to optimize for disk intensive workloads. This profile should already be enabled by default and may be checked, after Chapter 7, Deployment, by using a command like the following:

[stack@hci-director ~]$ for ip in $(nova list | grep compute | awk {'print $12'} | sed s/ctlplane=//g); do ssh heat-admin@$ip "/sbin/tuned-adm active"; done
Current active profile: throughput-performance
Current active profile: throughput-performance
Current active profile: throughput-performance
Current active profile: throughput-performance
[stack@hci-director ~]$