9.3. libvirt NUMA Tuning

Generally, optimal performance on NUMA systems is achieved by limiting guest size to the amount of resources on a single NUMA node. Avoid unnecessarily splitting resources across NUMA nodes.
Use the numastat tool to view per-NUMA-node memory statistics for processes and the operating system.
In the following example, the numastat tool shows four virtual machines with suboptimal memory alignment across NUMA nodes:
# numastat -c qemu-kvm

Per-node process memory usage (in MBs)
PID              Node 0 Node 1 Node 2 Node 3 Node 4 Node 5 Node 6 Node 7 Total
---------------  ------ ------ ------ ------ ------ ------ ------ ------ -----
51722 (qemu-kvm)     68     16    357   6936      2      3    147    598  8128
51747 (qemu-kvm)    245     11      5     18   5172   2532      1     92  8076
53736 (qemu-kvm)     62    432   1661    506   4851    136     22    445  8116
53773 (qemu-kvm)   1393      3      1      2     12      0      0   6702  8114
---------------  ------ ------ ------ ------ ------ ------ ------ ------ -----
Total              1769    463   2024   7462  10037   2672    169   7837 32434
You can run numad to align the guests' CPUs and memory resources automatically. However, it is highly recommended to configure guest resource alignment using libvirt instead: .
To verify that the memory has veen aligned, run numastat -c qemu-kvm again. The following output shows successful resource alignment:
# numastat -c qemu-kvm

Per-node process memory usage (in MBs)
PID              Node 0 Node 1 Node 2 Node 3 Node 4 Node 5 Node 6 Node 7 Total
---------------  ------ ------ ------ ------ ------ ------ ------ ------ -----
51747 (qemu-kvm)      0      0      7      0   8072      0      1      0  8080
53736 (qemu-kvm)      0      0      7      0      0      0   8113      0  8120
53773 (qemu-kvm)      0      0      7      0      0      0      1   8110  8118
59065 (qemu-kvm)      0      0   8050      0      0      0      0      0  8051
---------------  ------ ------ ------ ------ ------ ------ ------ ------ -----
Total                 0      0   8072      0   8072      0   8114   8110 32368

Note

Running numastat with -c provides compact output; adding the -m option adds system-wide memory information on a per-node basis to the output. Refer to the numastat man page for more information.
For optimal performance results, memory pinning should be used in combination with pinning of vCPU threads as well as other hypervisor threads.

9.3.1. NUMA vCPU Pinning

vCPU pinning provides similar advantages to task pinning on bare metal systems. Since vCPUs run as user-space tasks on the host operating system, pinning increases cache efficiency. One example of this is an environment where all vCPU threads are running on the same physical socket, therefore sharing a L3 cache domain.
Combining vCPU pinning with numatune can avoid NUMA misses. The performance impacts of NUMA misses are significant, generally starting at a 10% performance hit or higher. vCPU pinning and numatune should be configured together.
If the virtual machine is performing storage or network I/O tasks, it can be beneficial to pin all vCPUs and memory to the same physical socket that is physically connected to the I/O adapter.

Note

The lstopo tool can be used to visualize NUMA topology. It can also help verify that vCPUs are binding to cores on the same physical socket. Refer to the following Knowledgebase article for more information on lstopo: https://access.redhat.com/site/solutions/62879.

Important

Pinning causes increased complexity when there are many more vCPUs than physical cores.
The following example XML configuration has a domain process pinned to physical CPUs 0-7. The vCPU thread is pinned to its own cpuset. For example, vCPU0 is pinned to physical CPU 0, vCPU1 is pinned to physical CPU 1, and so on:
<vcpu cpuset='0-7'>8</vcpu>
	<cputune>
		<vcpupin vcpu='0' cpuset='0'/>
		<vcpupin vcpu='1' cpuset='1'/>
		<vcpupin vcpu='2' cpuset='2'/>
		<vcpupin vcpu='3' cpuset='3'/>
		<vcpupin vcpu='4' cpuset='4'/>
		<vcpupin vcpu='5' cpuset='5'/>
		<vcpupin vcpu='6' cpuset='6'/>
		<vcpupin vcpu='7' cpuset='7'/>
	</cputune>
There is a direct relationship between the vcpu and vcpupin tags. If a vcpupin option is not specified, the value will be automatically determined and inherited from the parent vcpu tag option. The following configuration shows <vcpupin > for vcpu 5 missing. Hence, vCPU5 would be pinned to physical CPUs 0-7, as specified in the parent tag <vcpu>:
<vcpu cpuset='0-7'>8</vcpu>
	<cputune>
		<vcpupin vcpu='0' cpuset='0'/>
		<vcpupin vcpu='1' cpuset='1'/>
		<vcpupin vcpu='2' cpuset='2'/>
		<vcpupin vcpu='3' cpuset='3'/>
		<vcpupin vcpu='4' cpuset='4'/>
		<vcpupin vcpu='6' cpuset='6'/>
		<vcpupin vcpu='7' cpuset='7'/>
	</cputune>