9.3. libvirt NUMA Tuning

Generally, best performance on NUMA systems is achieved by limiting guest size to the amount of resources on a single NUMA node. Avoid unnecessarily splitting resources across NUMA nodes.
Use the numastat tool to view per-NUMA-node memory statistics for processes and the operating system.
In the following example, the numastat tool shows four virtual machines with suboptimal memory alignment across NUMA nodes:
# numastat -c qemu-kvm

Per-node process memory usage (in MBs)
PID              Node 0 Node 1 Node 2 Node 3 Node 4 Node 5 Node 6 Node 7 Total
---------------  ------ ------ ------ ------ ------ ------ ------ ------ -----
51722 (qemu-kvm)     68     16    357   6936      2      3    147    598  8128
51747 (qemu-kvm)    245     11      5     18   5172   2532      1     92  8076
53736 (qemu-kvm)     62    432   1661    506   4851    136     22    445  8116
53773 (qemu-kvm)   1393      3      1      2     12      0      0   6702  8114
---------------  ------ ------ ------ ------ ------ ------ ------ ------ -----
Total              1769    463   2024   7462  10037   2672    169   7837 32434
Run numad to align the guests' CPUs and memory resources automatically.
Then run numastat -c qemu-kvm again to view the results of running numad. The following output shows that resources have been aligned:
# numastat -c qemu-kvm

Per-node process memory usage (in MBs)
PID              Node 0 Node 1 Node 2 Node 3 Node 4 Node 5 Node 6 Node 7 Total
---------------  ------ ------ ------ ------ ------ ------ ------ ------ -----
51747 (qemu-kvm)      0      0      7      0   8072      0      1      0  8080
53736 (qemu-kvm)      0      0      7      0      0      0   8113      0  8120
53773 (qemu-kvm)      0      0      7      0      0      0      1   8110  8118
59065 (qemu-kvm)      0      0   8050      0      0      0      0      0  8051
---------------  ------ ------ ------ ------ ------ ------ ------ ------ -----
Total                 0      0   8072      0   8072      0   8114   8110 32368

Note

Running numastat with -c provides compact output; adding the -m option adds system-wide memory information on a per-node basis to the output. Refer to the numastat man page for more information.

9.3.1. Monitoring Memory per host NUMA Node

You can use the nodestats.py script to report the total memory and free memory for each NUMA node on a host. This script also reports how much memory is strictly bound to certain host nodes for each running domain. For example:
# /usr/share/doc/libvirt-python-2.0.0/examples/nodestats.py
NUMA stats
NUMA nodes:     0       1       2       3
MemTotal:       3950    3967    3937    3943
MemFree:        66      56      42      41
Domain 'rhel7-0':
         Overall memory: 1536 MiB
Domain 'rhel7-1':
         Overall memory: 2048 MiB
Domain 'rhel6':
         Overall memory: 1024 MiB nodes 0-1
         Node 0: 1024 MiB nodes 0-1
Domain 'rhel7-2':
         Overall memory: 4096 MiB nodes 0-3
         Node 0: 1024 MiB nodes 0
         Node 1: 1024 MiB nodes 1
         Node 2: 1024 MiB nodes 2
         Node 3: 1024 MiB nodes 3
This example shows four host NUMA nodes, each containing approximately 4GB of RAM in total (MemTotal). Nearly all memory is consumed on each domain (MemFree). There are four domains (virtual machines) running: domain 'rhel7-0' has 1.5GB memory which is not pinned onto any specific host NUMA node. Domain 'rhel7-2' however, has 4GB memory and 4 NUMA nodes which are pinned 1:1 to host nodes.
To print host NUMA node statistics, create a nodestats.py script for your environment. An example script can be found the libvirt-python package files in /usr/share/doc/libvirt-python-version/examples/nodestats.py. The specific path to the script can be displayed by using the rpl -ql libvirt-python command.

9.3.2. NUMA vCPU Pinning

vCPU pinning provides similar advantages to task pinning on bare metal systems. Since vCPUs run as user-space tasks on the host operating system, pinning increases cache efficiency. One example of this is an environment where all vCPU threads are running on the same physical socket, therefore sharing a L3 cache domain.

Note

In Red Hat Enterprise Linux versions 7.0 to 7.2, it is only possible to pin active vCPUs. However, with Red Hat Enterprise Linux 7.3, pinning inactive vCPUs is available as well.
Combining vCPU pinning with numatune can avoid NUMA misses. The performance impacts of NUMA misses are significant, generally starting at a 10% performance hit or higher. vCPU pinning and numatune should be configured together.
If the virtual machine is performing storage or network I/O tasks, it can be beneficial to pin all vCPUs and memory to the same physical socket that is physically connected to the I/O adapter.

Note

The lstopo tool can be used to visualize NUMA topology. It can also help verify that vCPUs are binding to cores on the same physical socket. Refer to the following Knowledgebase article for more information on lstopo: https://access.redhat.com/site/solutions/62879.

Important

Pinning causes increased complexity where there are many more vCPUs than physical cores.
The following example XML configuration has a domain process pinned to physical CPUs 0-7. The vCPU thread is pinned to its own cpuset. For example, vCPU0 is pinned to physical CPU 0, vCPU1 is pinned to physical CPU 1, and so on:
<vcpu cpuset='0-7'>8</vcpu>
        <cputune>
                <vcpupin vcpu='0' cpuset='0'/>
                <vcpupin vcpu='1' cpuset='1'/>
                <vcpupin vcpu='2' cpuset='2'/>
                <vcpupin vcpu='3' cpuset='3'/>
                <vcpupin vcpu='4' cpuset='4'/>
                <vcpupin vcpu='5' cpuset='5'/>
                <vcpupin vcpu='6' cpuset='6'/>
                <vcpupin vcpu='7' cpuset='7'/>
        </cputune>
There is a direct relationship between the vcpu and vcpupin tags. If a vcpupin option is not specified, the value will be automatically determined and inherited from the parent vcpu tag option. The following configuration shows <vcpupin> for vcpu 5 missing. Hence, vCPU5 would be pinned to physical CPUs 0-7, as specified in the parent tag <vcpu>:
<vcpu cpuset='0-7'>8</vcpu>
        <cputune>
                <vcpupin vcpu='0' cpuset='0'/>
                <vcpupin vcpu='1' cpuset='1'/>
                <vcpupin vcpu='2' cpuset='2'/>
                <vcpupin vcpu='3' cpuset='3'/>
                <vcpupin vcpu='4' cpuset='4'/>
                <vcpupin vcpu='6' cpuset='6'/>
                <vcpupin vcpu='7' cpuset='7'/>
        </cputune>

Important

<vcpupin>, <numatune>, and <emulatorpin> should be configured together to achieve optimal, deterministic performance. For more information on the <numatune> tag, see Section 9.3.3, “Domain Processes”. For more information on the <emulatorpin> tag, see Section 9.3.5, “Using emulatorpin”.

9.3.3. Domain Processes

As provided in Red Hat Enterprise Linux, libvirt uses libnuma to set memory binding policies for domain processes. The nodeset for these policies can be configured either as static (specified in the domain XML) or auto (configured by querying numad). Refer to the following XML configuration for examples on how to configure these inside the <numatune> tag:
<numatune>
        <memory mode='strict' placement='auto'/>
</numatune>
<numatune>
        <memory mode='strict' nodeset='0,2-3'/>
</numatune>
libvirt uses sched_setaffinity(2) to set CPU binding policies for domain processes. The cpuset option can either be static (specified in the domain XML) or auto (configured by querying numad). Refer to the following XML configuration for examples on how to configure these inside the <vcpu> tag:
<vcpu placement='auto'>8</vcpu>
<vcpu placement='static' cpuset='0-10,ˆ5'>8</vcpu>
There are implicit inheritance rules between the placement mode you use for <vcpu> and <numatune>:
  • The placement mode for <numatune> defaults to the same placement mode of <vcpu>, or to static if a <nodeset> is specified.
  • Similarly, the placement mode for <vcpu> defaults to the same placement mode of <numatune>, or to static if <cpuset> is specified.
This means that CPU tuning and memory tuning for domain processes can be specified and defined separately, but they can also be configured to be dependent on the other's placement mode.
It is also possible to configure your system with numad to boot a selected number of vCPUs without pinning all vCPUs at startup.
For example, to enable only 8 vCPUs at boot on a system with 32 vCPUs, configure the XML similar to the following:
<vcpu placement='auto' current='8'>32</vcpu>

Note

Refer to the following URLs for more information on vcpu and numatune: http://libvirt.org/formatdomain.html#elementsCPUAllocation and http://libvirt.org/formatdomain.html#elementsNUMATuning

9.3.4. Domain vCPU Threads

In addition to tuning domain processes, libvirt also permits the setting of the pinning policy for each vcpu thread in the XML configuration. Set the pinning policy for each vcpu thread inside the <cputune> tags:
<cputune>
        <vcpupin vcpu="0" cpuset="1-4,ˆ2"/>
        <vcpupin vcpu="1" cpuset="0,1"/>
        <vcpupin vcpu="2" cpuset="2,3"/>
        <vcpupin vcpu="3" cpuset="0,4"/>
</cputune>
In this tag, libvirt uses either cgroup or sched_setaffinity(2) to pin the vcpu thread to the specified cpuset.

Note

For more details on <cputune>, refer to the following URL: http://libvirt.org/formatdomain.html#elementsCPUTuning
In addition, if you need to set up a virtual machines with more vCPU than a single NUMA node, configure the host so that the guest detects a NUMA topology on the host. This allows for 1:1 mapping of CPUs, memory, and NUMA nodes. For example, this can be applied with a guest with 4 vCPUs and 6 GB memory, and a host with the following NUMA settings:
4 available nodes (0-3)
Node 0:	CPUs 0 4, size 4000 MiB
Node 1: CPUs 1 5, size 3999 MiB
Node 2: CPUs 2 6, size 4001 MiB
Node 3: CPUs 0 4, size 4005 MiB
In this scenario, use the following Domain XML setting:
<cputune>
	<vcpupin vcpu="0" cpuset="1"/>
	<vcpupin vcpu="1" cpuset="5"/>
	<vcpupin vcpu="2" cpuset="2"/>
	<vcpupin vcpu="3" cpuset="6"/>
</cputune>
<numatune>
  <memory mode="strict" nodeset="1-2"/>
      <memnode cellid="0" mode="strict" nodeset="1"/>
      <memnode cellid="1" mode="strict" nodeset="2"/>
</numatune>
<cpu>
	<numa>
		<cell id="0" cpus="0-1" memory="3" unit="GiB"/>
		<cell id="1" cpus="2-3" memory="3" unit="GiB"/>
	</numa>
</cpu>

9.3.5. Using emulatorpin

Another way of tuning the domain process pinning policy is to use the <emulatorpin> tag inside of <cputune>.
The <emulatorpin> tag specifies which host physical CPUs the emulator (a subset of a domain, not including vCPUs) will be pinned to. The <emulatorpin> tag provides a method of setting a precise affinity to emulator thread processes. As a result, vhost threads run on the same subset of physical CPUs and memory, and therefore benefit from cache locality. For example:
<cputune>
        <emulatorpin cpuset="1-3"/>
</cputune>

Note

In Red Hat Enterprise Linux 7, automatic NUMA balancing is enabled by default. Automatic NUMA balancing reduces the need for manually tuning <emulatorpin>, since the vhost-net emulator thread follows the vCPU tasks more reliably. For more information about automatic NUMA balancing, see Section 9.2, “Automatic NUMA Balancing”.

9.3.6. Tuning vCPU Pinning with virsh

Important

These are example commands only. You will need to substitute values according to your environment.
The following example virsh command will pin the vcpu thread rhel7 which has an ID of 1 to the physical CPU 2:
% virsh vcpupin rhel7 1 2
You can also obtain the current vcpu pinning configuration with the virsh command. For example:
% virsh vcpupin rhel7

9.3.7. Tuning Domain Process CPU Pinning with virsh

Important

These are example commands only. You will need to substitute values according to your environment.
The emulatorpin option applies CPU affinity settings to threads that are associated with each domain process. For complete pinning, you must use both virsh vcpupin (as shown previously) and virsh emulatorpin for each guest. For example:
% virsh emulatorpin rhel7 3-4

9.3.8. Tuning Domain Process Memory Policy with virsh

Domain process memory can be dynamically tuned. Refer to the following example command:
% virsh numatune rhel7 --nodeset 0-10
More examples of these commands can be found in the virsh man page.

9.3.9. Guest NUMA Topology

Guest NUMA topology can be specified using the <numa> tag inside the <cpu> tag in the guest virtual machine's XML. Refer to the following example, and replace values accordingly:
<cpu>
        ...
    <numa>
      <cell cpus='0-3' memory='512000'/>
      <cell cpus='4-7' memory='512000'/>
    </numa>
    ...
</cpu>
Each <cell> element specifies a NUMA cell or a NUMA node. cpus specifies the CPU or range of CPUs that are part of the node, and memory specifies the node memory in kibibytes (blocks of 1024 bytes). Each cell or node is assigned a cellid or nodeid in increasing order starting from 0.

Important

When modifying the NUMA topology of a guest virtual machine with a configured topology of CPU sockets, cores, and threads, make sure that cores and threads belonging to a single socket are assigned to the same NUMA node. If threads or cores from the same socket are assigned to different NUMA nodes, the guest may fail to boot.

9.3.10. Assigning Host Huge Pages to Multiple Guest NUMA Nodes

In Red Hat Enterprise Linux 7.1 and above, huge pages from the host can be allocated to multiple guest NUMA nodes. This can optimize memory performance, as guest NUMA nodes can be moved to host NUMA nodes as required, while the guest can continue to use the huge pages allocated by the host.
After configuring the guest NUMA node topology (see Section 9.3.9, “Guest NUMA Topology” for details), specify the huge page size and the guest NUMA nodeset in the <memoryBacking> element in the guest XML. The page size and unit refer to the size of the huge pages from the host. The nodeset specifies the guest NUMA node (or several nodes) to which huge pages will be assigned.
In the following example, guest NUMA nodes 0-5 (except for NUMA node 4) will use 1GB huge pages, and guest NUMA node 4 will use 2MB huge pages, regardless of guest NUMA node placement on the host. To use 1GB huge pages in guests, the host must be booted first with 1GB huge pages enabled; see Section 8.2.3, “Huge Pages and Transparent Huge Pages (THP)” for instructions on enabling 1GB huge pages.
<memoryBacking>
        <hugepages/>
          <page size="1" unit="G" nodeset="0-3,5"/>
          <page size="2" unit="M" nodeset="4"/>
        </hugepages>
</memoryBacking>
This allows for greater control over huge pages in a situation where it is useful to merge some guest NUMA nodes onto a single host NUMA node, but continue to use different huge page sizes. For example, even if guest NUMA nodes 4 and 5 are moved to the host's NUMA node 1, both continue to use different sizes of huge pages.

Note

When using strict memory mode, the guest will fail to start when there are not enough huge pages available on the NUMA node. See Section 9.3.3, “Domain Processes” for a configuration example of the strict memory mode option within the <numatune> tag.

9.3.11. NUMA Node Locality for PCI Devices

When starting a new virtual machine, it is important to know both the host NUMA topology and the PCI device affiliation to NUMA nodes, so that when PCI passthrough is requested, the guest is pinned onto the correct NUMA nodes for optimal memory performance.
For example, if a guest is pinned to NUMA nodes 0-1, but one of its PCI devices is affiliated with node 2, data transfer between nodes will take some time.
In Red Hat Enterprise Linux 7.1 and above, libvirt reports the NUMA node locality for PCI devices in the guest XML, enabling management applications to make better performance decisions.
This information is visible in the sysfs files in /sys/devices/pci*/*/numa_node. One way to verify these settings is to use the lstopo tool to report sysfs data:
# lstopo-no-graphics
Machine (126GB)
  NUMANode L#0 (P#0 63GB)
    Socket L#0 + L3 L#0 (20MB)
      L2 L#0 (256KB) + L1d L#0 (32KB) + L1i L#0 (32KB) + Core L#0 + PU L#0 (P#0)
      L2 L#1 (256KB) + L1d L#1 (32KB) + L1i L#1 (32KB) + Core L#1 + PU L#1 (P#2)
      L2 L#2 (256KB) + L1d L#2 (32KB) + L1i L#2 (32KB) + Core L#2 + PU L#2 (P#4)
      L2 L#3 (256KB) + L1d L#3 (32KB) + L1i L#3 (32KB) + Core L#3 + PU L#3 (P#6)
      L2 L#4 (256KB) + L1d L#4 (32KB) + L1i L#4 (32KB) + Core L#4 + PU L#4 (P#8)
      L2 L#5 (256KB) + L1d L#5 (32KB) + L1i L#5 (32KB) + Core L#5 + PU L#5 (P#10)
      L2 L#6 (256KB) + L1d L#6 (32KB) + L1i L#6 (32KB) + Core L#6 + PU L#6 (P#12)
      L2 L#7 (256KB) + L1d L#7 (32KB) + L1i L#7 (32KB) + Core L#7 + PU L#7 (P#14)
    HostBridge L#0
      PCIBridge
        PCI 8086:1521
          Net L#0 "em1"
        PCI 8086:1521
          Net L#1 "em2"
        PCI 8086:1521
          Net L#2 "em3"
        PCI 8086:1521
          Net L#3 "em4"
      PCIBridge
        PCI 1000:005b
          Block L#4 "sda"
          Block L#5 "sdb"
          Block L#6 "sdc"
          Block L#7 "sdd"
      PCIBridge
        PCI 8086:154d
          Net L#8 "p3p1"
        PCI 8086:154d
          Net L#9 "p3p2"
      PCIBridge
        PCIBridge
          PCIBridge
            PCIBridge
              PCI 102b:0534
                GPU L#10 "card0"
                GPU L#11 "controlD64"
      PCI 8086:1d02
  NUMANode L#1 (P#1 63GB)
    Socket L#1 + L3 L#1 (20MB)
      L2 L#8 (256KB) + L1d L#8 (32KB) + L1i L#8 (32KB) + Core L#8 + PU L#8 (P#1)
      L2 L#9 (256KB) + L1d L#9 (32KB) + L1i L#9 (32KB) + Core L#9 + PU L#9 (P#3)
      L2 L#10 (256KB) + L1d L#10 (32KB) + L1i L#10 (32KB) + Core L#10 + PU L#10 (P#5)
      L2 L#11 (256KB) + L1d L#11 (32KB) + L1i L#11 (32KB) + Core L#11 + PU L#11 (P#7)
      L2 L#12 (256KB) + L1d L#12 (32KB) + L1i L#12 (32KB) + Core L#12 + PU L#12 (P#9)
      L2 L#13 (256KB) + L1d L#13 (32KB) + L1i L#13 (32KB) + Core L#13 + PU L#13 (P#11)
      L2 L#14 (256KB) + L1d L#14 (32KB) + L1i L#14 (32KB) + Core L#14 + PU L#14 (P#13)
      L2 L#15 (256KB) + L1d L#15 (32KB) + L1i L#15 (32KB) + Core L#15 + PU L#15 (P#15)
    HostBridge L#8
      PCIBridge
        PCI 1924:0903
          Net L#12 "p1p1"
        PCI 1924:0903
          Net L#13 "p1p2"
      PCIBridge
        PCI 15b3:1003
          Net L#14 "ib0"
          Net L#15 "ib1"
          OpenFabrics L#16 "mlx4_0"


This output shows:
  • NICs em* and disks sd* are connected to NUMA node 0 and cores 0,2,4,6,8,10,12,14.
  • NICs p1* and ib* are connected to NUMA node 1 and cores 1,3,5,7,9,11,13,15.