Virtualization Tuning and Optimization Guide
Optimizing your virtual environment
Chapter 1. Introduction
1.1. Why Performance Optimization Matters in Virtualization
1.2. KVM Performance Architecture Overview
- When using KVM, guests run as a Linux processes on the host.
- Virtual CPUs (vCPUs) are implemented as normal threads, handled by the Linux scheduler.
- Guests do not automatically inherit features such as NUMA and huge pages from the kernel.
- Disk and network I/O settings in the host have a significant performance impact.
- Network traffic typically travels through a software-based bridge.
- Depending on the devices and their models, there might be significant overhead due to emulation of that particular hardware.
1.3. Virtualization Performance Features and Improvements
Virtualization Performance Improvements in Red Hat Enterprise Linux 7
- Automatic NUMA Balancing
- Automatic NUMA balancing improves the performance of applications running on NUMA hardware systems, without any manual tuning required for Red Hat Enterprise Linux 7 guests. Automatic NUMA balancing moves tasks, which can be threads or processes, closer to the memory they are accessing. This enables good performance with zero configuration. However, in some circumstances, providing more accurate guest configuration or setting up guest to host affinities for CPUs and memory may provide better results.For more information on automatic NUMA balancing, see Section 9.2, “Automatic NUMA Balancing”.
- VirtIO models
- Any virtual hardware that has the virtio model does not have the overhead of emulating the hardware with all its particularities. VirtIO devices have low overhead thanks to the fact that they are designed specifically for use in Virtualization environments. However not all guest operating systems support such models.For more information on virtio, see (?? TBD)
- I/O threads
- QEMU is capable of offloading some I/O to threads running specifically for that purpose. Such threads may also have their own settings (such as affinities). This feature does not necessarily provide performance benefits as it depends highly on the workload of the guest. Proper benchmarking is recommended in order to assess the suitability of I/O threads.For more information on virtio, see (?? TBD)
- Multi-queue virtio-net
- A networking approach that enables packet sending/receiving processing to scale with the number of available vCPUs of the guest.For more information on multi-queue virtio-net, see Section 5.4.2, “Multi-Queue virtio-net”.
- Bridge Zero Copy Transmit
- Zero copy transmit mode reduces the host CPU overhead in transmitting large packets between a guest network and an external network by up to 15%, without affecting throughput. Bridge zero copy transmit is fully supported on Red Hat Enterprise Linux 7 virtual machines, but disabled by default.For more information on zero copy transmit, see Section 5.4.1, “Bridge Zero Copy Transmit”.
- APIC Virtualization (APICv)
- Newer Intel processors offer hardware virtualization of the Advanced Programmable Interrupt Controller (APICv). APICv improves virtualized AMD64 and Intel 64 guest performance by allowing the guest to directly access the APIC, dramatically cutting down interrupt latencies and the number of virtual machine exits caused by the APIC. This feature is used by default in newer Intel processors and improves I/O performance.
- EOI Acceleration
- End-of-interrupt acceleration for high bandwidth I/O on older chipsets without virtual APIC capabilities.
- Multi-queue virtio-scsi
- Improved storage performance and scalability provided by multi-queue support in the virtio-scsi driver. This enables each virtual CPU to have a separate queue and interrupt to use without affecting other vCPUs.For more information on multi-queue virtio-scsi, see Section 7.4.2, “Multi-Queue virtio-scsi”.
- Paravirtualized Ticketlocks
- Paravirtualized ticketlocks (pvticketlocks) improve the performance of Red Hat Enterprise Linux 7 guest virtual machines running on Red Hat Enterprise Linux 7 hosts with oversubscribed CPUs.
- Paravirtualized Page Faults
- Paravirtualized page faults are injected into a guest when it attempts to access a page swapped out by the host. This improves KVM guest performance when host memory is overcommitted and guest memory is swapped out.
- Paravirtualized Time
clock_gettimesystem calls execute in the user space through the
vsyscallmechanism. Previously, issuing these system calls required the system to switch into kernel mode, and then back into the user space. This greatly improves performance for some applications.
Virtualization Performance Features in Red Hat Enterprise Linux
- NUMA - Non-Uniform Memory Access. See Chapter 9, NUMA for details on NUMA.
- CFS - Completely Fair Scheduler. A modern class-focused scheduler.
- RCU - Read Copy Update. Better handling of shared thread data.
- Up to 160 virtual CPUs (vCPUs).
- huge pages and other optimizations for memory-intensive environments. See Chapter 8, Memory for details.
- vhost-net - A fast, kernel-based VirtIO solution.
- SR-IOV - For near-native networking performance levels.
- Block I/O
- AIO - Support for a thread to overlap other I/O operations.
- MSI - PCI bus device interrupt generation.
- Disk I/O throttling - Controls on guest disk I/O requests to prevent over-utilizing host resources. See Section 7.4.1, “Disk I/O Throttling” for details.
Chapter 2. Performance Monitoring Tools
2.1. perf kvm
perfcommand with the
kvmoption to collect and analyze guest operating system statistics from the host. The perf package provides the
perfcommand. It is installed by running the following command:
# yum install perf
perf kvmin the host, you must have access to the
/proc/kallsymsfiles from the guest. Refer to the following procedure, Procedure 2.1, “Copying /proc files from guest to host” to transfer the files into the host and run reports on the files.
Procedure 2.1. Copying /proc files from guest to host
scp) you will only copy files of zero length. This procedure describes how to first save the files in the guest to a temporary location (with the
catcommand), and then copy them to the host for use by
Log in to the guest and save filesLog in to the guest and save
/proc/kallsymsto a temporary location,
# cat /proc/modules > /tmp/modules # cat /proc/kallsyms > /tmp/kallsyms
Copy the temporary files to the hostOnce you have logged off from the guest, run the following example
scpcommands to copy the saved files to the host. You should substitute your host name and TCP port if they are different:
# scp root@GuestMachine:/tmp/kallsyms guest-kallsyms # scp root@GuestMachine:/tmp/modules guest-modulesYou now have two files from the guest (
guest-modules) on the host, ready for use by
Recording and reporting events with perf kvmUsing the files obtained in the previous steps, recording and reporting of events in the guest, the host, or both is now possible.Run the following example command:
# perf kvm --host --guest --guestkallsyms=guest-kallsyms \ --guestmodules=guest-modules record -a -o perf.data
NoteIf both --host and --guest are used in the command, output will be stored in
perf.data.kvm. If only --host is used, the file will be named
perf.data.host. Similarly, if only --guest is used, the file will be named
perf.data.guest.Pressing Ctrl-C stops recording.
Reporting eventsThe following example command uses the file obtained by the recording process, and redirects the output into a new file,
perf kvm --host --guest --guestmodules=guest-modules report -i perf.data.kvm \ --force > analyzeView the contents of the
analyzefile to examine the recorded events:
# cat analyze# Events: 7K cycles # # Overhead Command Shared Object Symbol # ........ ............ ................. ......................... # 95.06% vi vi [.] 0x48287 0.61% init [kernel.kallsyms] [k] intel_idle 0.36% vi libc-2.12.so [.] _wordcopy_fwd_aligned 0.32% vi libc-2.12.so [.] __strlen_sse42 0.14% swapper [kernel.kallsyms] [k] intel_idle 0.13% init [kernel.kallsyms] [k] uhci_irq 0.11% perf [kernel.kallsyms] [k] generic_exec_single 0.11% init [kernel.kallsyms] [k] tg_shares_up 0.10% qemu-kvm [kernel.kallsyms] [k] tg_shares_up [output truncated...]
2.2. Virtual Performance Monitoring Unit (vPMU)
arch_perfmonflag on the host CPU by running:
# cat /proc/cpuinfo|grep arch_perfmon
cpu modein the guest XML as
# virsh dumpxml guest_name |grep "cpu mode"<cpu mode='host-passthrough'>
perfcommand from the guest virtual machine.
2.3. Monitoring Performance in Virtual Machine Manager
2.3.1. Viewing a Performance Overview in Virtual Machine Manager
- In the Virtual Machine Manager main window, highlight the virtual machine that you want to view.
Figure 2.1. Selecting a virtual machine to display
- From the Virtual Machine Manager Edit menu, select Virtual Machine Details.When the Virtual Machine details window opens, there may be a console displayed. Should this happen, click View and then select Details. The Overview window opens first by default.
- Select Performance from the navigation pane on the left hand side.The Performance view shows a summary of guest performance, including CPU and Memory usage and Disk and Network input and output.
Figure 2.2. Displaying guest performance details
2.3.2. Performance Monitoring
virt-manager's preferences window.
- From the Edit menu, select Preferences.The Preferences window appears.
- From the Polling tab specify the time in seconds or stats polling options.
Figure 2.3. Configuring performance monitoring
2.3.3. Displaying CPU Usage for Guests
- From the View menu, select Graph, then the Guest CPU Usage check box.
- The Virtual Machine Manager shows a graph of CPU usage for all virtual machines on your system.
Figure 2.4. Guest CPU usage graph
2.3.4. Displaying CPU Usage for Hosts
- From the View menu, select Graph, then the Host CPU Usage check box.
- The Virtual Machine Manager shows a graph of host CPU usage on your system.
Figure 2.5. Host CPU usage graph
2.3.5. Displaying Disk I/O
- Make sure that the Disk I/O statistics collection is enabled. To do this, from the Edit menu, select Preferences and click the Polling tab.
- Select the Disk I/O check box.
Figure 2.6. Enabling Disk I/O
- To enable the Disk I/O display, from the View menu, select Graph, then the Disk I/O check box.
- The Virtual Machine Manager shows a graph of Disk I/O for all virtual machines on your system.
Figure 2.7. Displaying Disk I/O
2.3.6. Displaying Network I/O
- Make sure that the Network I/O statistics collection is enabled. To do this, from the Edit menu, select Preferences and click the Pollingtab.
- Select the Network I/O check box.
Figure 2.8. Enabling Network I/O
- To display the Network I/O statistics, from the View menu, select Graph, then the Network I/O check box.
- The Virtual Machine Manager shows a graph of Network I/O for all virtual machines on your system.
Figure 2.9. Displaying Network I/O
2.3.7. Displaying Memory Usage
- Make sure that the memory usage statistics collection is enabled. To do this, from the Edit menu, select Preferences and click the Pollingtab.
- Select the Poll Memory stats check box.
Figure 2.10. Enabling memory usage
- To display the memory usage, from the View menu, select Graph, then the Memory Usage check box.
- The Virtual Machine Manager lists the percentage of memory in use (in megabytes) for all virtual machines on your system.
Figure 2.11. Displaying memory usage
Chapter 3. Optimizing Virtualization Performance with virt-manager
3.1. Operating System Details and Devices
3.1.1. Specifying Guest Virtual Machine Details
Figure 3.1. Provide the OS type and Version
3.1.2. Remove Unused Devices
Figure 3.2. Remove unused devices
3.2. CPU Performance Options
Figure 3.3. CPU Performance Options
3.2.1. Option: Available CPUs
Figure 3.4. CPU overcommit
3.2.2. Option: CPU Configuration
Figure 3.5. CPU Configuration Options
virsh capabilitiescommand on your host machine to view the virtualization capabilities of your system, including CPU types and NUMA capabilities.
3.2.3. Option: CPU Topology
Figure 3.6. CPU Topology Options
3.3. Virtual Disk Performance Options
Figure 3.7. Virtual Disk Performance Options
Chapter 4. tuned and tuned-adm
- Based on the
virtual-guestalso decreases the swappiness of virtual memory.The
virtual-guestprofile is automatically selected when creating a Red Hat Enterprise Linux 7 guest virtual machine. It is the recommended profile for virtual machines.This profile is available in Red Hat Enterprise Linux 6.3 and later, but must be manually selected when installing a virtual machine.
- Based on the
virtual-hostalso enables more aggressive writeback of dirty pages. This profile is the recommended profile for virtualization hosts, including both KVM and Red Hat Virtualization (RHV) hosts.
tunedservice is enabled.
# tuned-adm listAvailable profiles: - balanced - desktop - latency-performance - network-latency - network-throughput - powersave - sap - throughput-performance - virtual-guest - virtual-host Current active profile: throughput-performance
tuned-adm profile profile_name
tuned-adm profile virtual-host
systemctl enable tuned
tuned-adm off; systemctl disable tuned
Chapter 5. Networking
5.1. Networking Tuning Tips
- Use multiple networks to avoid congestion on a single network. For example, have dedicated networks for management, backups, or live migration.
- Red Hat recommends not using multiple interfaces in the same network segment. However, if this is unavoidable, you can use
arp_filterto prevent ARP Flux, an undesirable condition that can occur in both hosts and guests and is caused by the machine responding to ARP requests from more than one network interface:
echo 1 > /proc/sys/net/ipv4/conf/all/arp_filteror edit
/etc/sysctl.confto make this setting persistent.
5.2. Virtio and vhost_net
Figure 5.1. Virtio and vhost_net architectures
5.3. Device Assignment and SR-IOV
Figure 5.2. Device assignment and SR-IOV
5.4. Network Tuning Techniques
5.4.1. Bridge Zero Copy Transmit
experimental_zcopytxkernel module parameter for the vhost_net module to 1.
5.4.2. Multi-Queue virtio-net
- Traffic packets are relatively large.
- The guest is active on many connections at the same time, with traffic running between guests, guest to host, or guest to an external system.
- The number of queues is equal to the number of vCPUs. This is because multi-queue support optimizes RX interrupt affinity and TX queue selection in order to make a specific queue private to a specific vCPU.
22.214.171.124. Configuring Multi-Queue virtio-net
<interface type='network'> <source network='default'/> <model type='virtio'/> <driver name='vhost' queues='N'/> </interface>
# ethtool -L eth0 combined M
max_filesvariable in the
/etc/libvirt/qemu.conffile to 2048. The default limit of 1024 can be insufficient for multi-queue and cause guests to be unable to start when multi-queue is configured.
5.5. Batching Network Packets
ethtool -C $tap rx-framesN
taprx batching for type='bridge' or type='network' interfaces, add a snippet similar to the following to the domain XML file.
... <devices> <interface type='network'> <source network='default'/> <target dev='vnet0'/> <coalesce> <rx> <frames max='7'/> </rx> </coalesce> </interface> </devices>
Chapter 6. I/O Scheduling
6.1. I/O Scheduling
6.2. I/O Scheduling with Red Hat Enterprise Linux as a Virtualization Host
cfqscheduler is usually ideal. This scheduler performs well on nearly all workloads.
6.3. I/O Scheduling with Red Hat Enterprise Linux as a Virtualization Guest
- Red Hat Enterprise Linux guests often benefit greatly from the
noopscheduler allows the host machine or hypervisor to optimize the input/output requests. The
noopscheduler can combine small requests from the guest operating system into larger requests, before handing the I/O to the hypervisor. However,
nooptries to use the fewest number of CPU cycles in the guest for I/O scheduling. The host/hypervisor has an overview of the requests of all guests and uses a separate strategy for handling I/O.
NoteFor Red Hat Enterprise Linux 7.2 and newer,
noopimplicitly. This is because it uses
- Depending on the workload I/O and how storage devices are attached, schedulers like
deadlinecan be more advantageous. Performance testing is required to verify which scheduler is the most advantageous.
- Guests using storage accessed by iSCSI, SR-IOV, or physical device passthrough should not use the
noopscheduler. These methods do not allow the host to optimize I/O requests to the underlying physical device.
deadlinein the guest virtual machine.
6.4. Configuring the I/O Scheduler
6.4.1. Configuring the I/O Scheduler for Red Hat Enterprise Linux 5 and 6
grub.confstanza, the system is configured to use the
noopscheduler. The example shown is for VMware ESX.
title Red Hat Enterprise Linux Server (2.6.18-8.el5) root (hd0,0) kernel /vmlinuz-2.6.18-8.el5 ro root=/dev/vg0/lv0 elevator=noop initrd /initrd-2.6.18-8.el5.img
6.4.2. Configuring the I/O Scheduler for Red Hat Enterprise Linux 7
# vi /etc/grub2.cfglinux16 /vmlinuz-kernel-version root=/dev/mapper/vg0-lv0 ro rd.lvm.lv=vg0/lv0 vconsole.keymap=us vconsole.font=latarcyrheb-sun16 rhgb quiet elevator=deadline initrd16 /initramfs-kernel-version.img
Chapter 7. Block I/O
7.1. Block I/O Tuning
virsh blkiotunecommand allows administrators to set or display a guest virtual machine's block I/O parameters manually in the
<blkio>element in the guest XML configuration.
<blkio>parameters for a virtual machine:
# virsh blkiotune virtual_machine
<blkio>parameters, refer to the following command and replace values according to your environment:
# virsh blkiotune virtual_machine [--weight number] [--device-weights string]
[--config] [--live] [--current]
- The I/O weight, within the range 100 to 1000.
- A single string listing one or more device/weight pairs, in the format of
. Each weight must be within the range 100-1000, or the value 0 to remove that device from per-device listings. Only the devices listed in the string are modified; any existing per-device weights for other devices remain unchanged.
- Add the
option for changes to take effect at next boot.
- Add the
option to apply the changes to the running virtual machine.
option requires the hypervisor to support this action. Not all hypervisors allow live changes of the maximum memory limit.
- Add the
option to apply the changes to the current virtual machine.
virsh help blkiotunecommand for more information on using the
Table 7.1. Caching options
|Cache=none||I/O from the guest is not cached on the host, but may be kept in a writeback disk cache. Use this option for guests with large I/O requirements. This option is generally the best choice, and is the only option to support migration.|
|Cache=writethrough||I/O from the guest is cached on the host but written through to the physical medium. This mode is slower and prone to scaling problems. Best used for small number of guests with lower I/O requirements. Suggested for guests that do not support a writeback cache (such as Red Hat Enterprise Linux 5.5 and earlier), where migration is not needed.|
|Cache=writeback||I/O from the guest is cached on the host.|
|Cache=directsync||Similar to |
|Cache=unsafe||The host may cache all disk I/O, and sync requests from guest are ignored.|
|Cache=default||If no cache mode is specified, the system's default settings are chosen.|
cachesetting inside the
drivertag to specify a caching option. For example, to set the cache as
<disk type='file' device='disk'> <driver name='qemu' type='raw' cache='writeback'/>
7.3. I/O Mode
Table 7.2. IO mode options
|IO Mode Option||Description|
|IO=native||The default for Red Hat Virtualization (RHV) environments. This mode refers to kernel asynchronous I/O with direct I/O options.|
|IO=threads||The default are host user-mode based threads.|
|IO=default||The default in Red Hat Enterprise Linux 7 is threads mode.|
iosetting inside the
default. For example, to set the I/O mode to
<disk type='file' device='disk'> <driver name='qemu' type='raw' io='threads'/>
7.4. Block I/O Tuning Techniques
7.4.1. Disk I/O Throttling
virsh blkdeviotunecommand to set I/O limits for a virtual machine. Refer to the following example:
# virsh blkdeviotune virtual_machine device --parameter limit
<target dev='name'/>) or source file (
<source file='name'/>) for one of the disk devices attached to the virtual machine. Use the
virsh domblklistcommand for a list of disk device names.
- The total throughput limit in bytes per second.
- The read throughput limit in bytes per second.
- The write throughput limit in bytes per second.
- The total I/O operations limit per second.
- The read I/O operations limit per second.
- The write I/O operations limit per second.
virtual_machineto 1000 I/O operations per second and 50 MB per second throughput, run this command:
# virsh blkdeviotune virtual_machine vda --total-iops-sec 1000 --total-bytes-sec 52428800
7.4.2. Multi-Queue virtio-scsi
126.96.36.199. Configuring Multi-Queue virtio-scsi
<controller type='scsi' index='0' model='virtio-scsi'> <driver queues='N' /> </controller>
Chapter 8. Memory
8.1. Memory Tuning Tips
- Do not allocate more resources to guest than it will use.
- If possible, assign a guest to a single NUMA node, providing that resources are sufficient on that NUMA node. For more information on using NUMA, see Chapter 9, NUMA.
8.2. Memory Tuning on Virtual Machines
8.2.1. Memory Monitoring Tools
8.2.2. Memory Tuning with virsh
<memtune>element in the guest XML configuration allows administrators to configure guest virtual machine memory settings manually. If
<memtune>is omitted, default memory settings apply.
<memtune>element in a virtual machine with the
virsh memtunecommand, replacing values according to your environment:
# virsh memtune virtual_machine --parameter size
- The maximum memory the virtual machine can use, in kibibytes (blocks of 1024 bytes).
WarningSetting this limit too low can result in the virtual machine being killed by the kernel.
- The memory limit to enforce during memory contention, in kibibytes (blocks of 1024 bytes).
- The maximum memory plus swap the virtual machine can use, in kibibytes (blocks of 1024 bytes). The
swap_hard_limitvalue must be more than the
- The guaranteed minimum memory allocation for the virtual machine, in kibibytes (blocks of 1024 bytes).
# virsh help memtunefor more information on using the
<memoryBacking>element may contain several elements that influence how virtual memory pages are backed by host pages.
lockedprevents the host from swapping out memory pages belonging to the guest. Add the following to the guest XML to lock the virtual memory pages in the host's memory:
<memoryBacking> <locked/> </memoryBacking>
hard_limitmust be set in the
<memtune>element to the maximum memory configured for the guest, plus any memory consumed by the process itself.
nosharepagesprevents the host from merging the same memory used among guests. To instruct the hypervisor to disable share pages for a guest, add the following to the guest's XML:
<memoryBacking> <nosharepages/> </memoryBacking>
8.2.3. Huge Pages and Transparent Huge Pages (THP)
188.8.131.52. Configuring Transparent Huge Pages
# cat /sys/kernel/mm/transparent_hugepage/enabled
This will set
# echo always > /sys/kernel/mm/transparent_hugepage/enabled
# echo never > /sys/kernel/mm/transparent_hugepage/enabled
184.108.40.206. Configuring Static Huge Pages
<memoryBacking> <hugepages/> </memoryBacking>
Procedure 8.1. Setting huge pages
- View the current huge pages value:
# cat /proc/meminfo | grep HugeAnonHugePages: 2048 kB HugePages_Total: 0 HugePages_Free: 0 HugePages_Rsvd: 0 HugePages_Surp: 0 Hugepagesize: 2048 kB
- Huge pages are set in increments of 2MB. To set the number of huge pages to 25000, use the following command:
echo 25000 > /proc/sys/vm/nr_hugepages
NoteTo make the setting persistent, add the following lines to the
/etc/sysctl.conffile on the guest machine, with X being the intended number of huge pages:
# echo 'vm.nr_hugepages = X' >> /etc/sysctl.conf # sysctl -pAfterwards, add
transparent_hugepage=neverto the kernel boot parameters by appending it to the end of the
/kernelline in the
/etc/grub2.cfgfile on the guest.
- Mount the huge pages:
# mount -t hugetlbfs hugetlbfs /dev/hugepages
- Restart libvirtd, then restart the virtual machine with the following commands:
# systemctl start libvirtd
# virsh start virtual_machine
- Verify the changes in
# cat /proc/meminfo | grep HugeAnonHugePages: 0 kB HugePages_Total: 25000 HugePages_Free: 23425 HugePages_Rsvd: 0 HugePages_Surp: 0 Hugepagesize: 2048 kB
220.127.116.11. Enabling 1 GB huge pages for guests at boot or runtime
Procedure 8.2. Allocating 1 GB huge pages at boot time
- To allocate different sizes of huge pages at boot time, use the following command, specifying the number of huge pages. This example allocates 4 1 GB huge pages and 1024 2 MB huge pages:
'default_hugepagesz=1G hugepagesz=1G hugepages=4 hugepagesz=2M hugepages=1024'Change this command line to specify a different number of huge pages to be allocated at boot.
NoteThe next two steps must also be completed the first time you allocate 1 GB huge pages at boot time.
- Mount the 2 MB and 1 GB huge pages on the host:
# mkdir /dev/hugepages1G # mount -t hugetlbfs -o pagesize=1G none /dev/hugepages1G # mkdir /dev/hugepages2M # mount -t hugetlbfs -o pagesize=2M none /dev/hugepages2M
- Restart libvirtd to enable the use of 1 GB huge pages on guests:
# systemctl restart libvirtd
Procedure 8.3. Allocating 1 GB huge pages at runtime
- To allocate different sizes of huge pages at runtime, use the following command, replacing values for the number of huge pages, the NUMA node to allocate them from, and the huge page size:
# echo 4 > /sys/devices/system/node/node1/hugepages/hugepages-1048576kB/nr_hugepages # echo 1024 > /sys/devices/system/node/node3/hugepages/hugepages-2048kB/nr_hugepagesThis example command allocates 4 1 GB huge pages from
node1and 1024 2MB huge pages from
node3.These huge page settings can be changed at any time with the above command, depending on the amount of free memory on the host system.
NoteThe next two steps must also be completed the first time you allocate 1 GB huge pages at runtime.
- Mount the 2 MB and 1 GB huge pages on the host:
# mkdir /dev/hugepages1G # mount -t hugetlbfs -o pagesize=1G none /dev/hugepages1G # mkdir /dev/hugepages2M # mount -t hugetlbfs -o pagesize=2M none /dev/hugepages2M
- Restart libvirtd to enable the use of 1 GB huge pages on guests:
# systemctl restart libvirtd
8.3. Kernel Same-page Merging (KSM)
qemu-kvmprocess. Once the guest is running, the contents of the guest operating system image can be shared when guests are running the same operating system or applications. KSM allows KVM to request that these identical guest memory regions be shared.
0to avoid merging pages across NUMA nodes. This can be done with the
virsh node-memory-tune --shm-merge-across-nodes 0command. Kernel memory accounting statistics can eventually contradict each other after large amounts of cross-node merging. As such, numad can become confused after the KSM daemon merges large amounts of memory. If your system has a large amount of free memory, you may achieve higher performance by turning off and disabling the KSM daemon. Refer to Chapter 9, NUMA" for more information on NUMA.
ksmservice starts and stops the KSM kernel thread.
ksmtunedservice controls and tunes the
ksmservice, dynamically managing same-page merging.
ksmservice and stops the
ksmservice if memory sharing is not necessary. When new guests are created or destroyed,
ksmtunedmust be instructed with the
retuneparameter to run.
8.3.1. The KSM Service
ksmservice is included in the qemu-kvm package.
- When the
ksmservice is not started, Kernel same-page merging (KSM) shares only 2000 pages. This default value provides limited memory-saving benefits.
- When the
ksmservice is started, KSM will share up to half of the host system's main memory. Start the
ksmservice to enable KSM to share more memory.
# systemctl start ksmStarting ksm: [ OK ]
ksmservice can be added to the default startup sequence. Make the
ksmservice persistent with the systemctl command.
# systemctl enable ksm
8.3.2. The KSM Tuning Service
ksmtunedservice fine-tunes the kernel same-page merging (KSM) configuration by looping and adjusting
ksm. In addition, the
ksmtunedservice is notified by libvirt when a guest virtual machine is created or destroyed. The
ksmtunedservice has no options.
# systemctl start ksmtunedStarting ksmtuned: [ OK ]
ksmtunedservice can be tuned with the
retuneparameter, which instructs
ksmtunedto run tuning functions manually.
/etc/ksmtuned.conffile is the configuration file for the
ksmtunedservice. The file output below is the default
# Configuration file for ksmtuned. # How long ksmtuned should sleep between tuning adjustments # KSM_MONITOR_INTERVAL=60 # Millisecond sleep between ksm scans for 16Gb server. # Smaller servers sleep more, bigger sleep less. # KSM_SLEEP_MSEC=10 # KSM_NPAGES_BOOST - is added to the `npages` value, when `free memory` is less than `thres`. # KSM_NPAGES_BOOST=300 # KSM_NPAGES_DECAY - is the value given is subtracted to the `npages` value, when `free memory` is greater than `thres`. # KSM_NPAGES_DECAY=-50 # KSM_NPAGES_MIN - is the lower limit for the `npages` value. # KSM_NPAGES_MIN=64 # KSM_NPAGES_MAX - is the upper limit for the `npages` value. # KSM_NPAGES_MAX=1250 # KSM_THRES_COEF - is the RAM percentage to be calculated in parameter `thres`. # KSM_THRES_COEF=20 # KSM_THRES_CONST - If this is a low memory system, and the `thres` value is less than `KSM_THRES_CONST`, then reset `thres` value to `KSM_THRES_CONST` value. # KSM_THRES_CONST=2048 # uncomment the following to enable ksmtuned debug information # LOGFILE=/var/log/ksmtuned # DEBUG=1
npagessets how many pages
ksmwill scan before the
ksmddaemon becomes inactive. This value will also be set in the
KSM_THRES_CONSTvalue represents the amount of available memory used as a threshold to activate
ksmdis activated if either of the following occurs:
- The amount of free memory drops below the threshold, set in
- The amount of committed memory plus the threshold,
KSM_THRES_CONST, exceeds the total amount of memory.
8.3.3. KSM Variables and Monitoring
/sys/kernel/mm/ksm/directory. Files in this directory are updated by the kernel and are an accurate record of KSM usage and statistics.
/etc/ksmtuned.conffile, as noted above.
- Full scans run.
- Whether pages from different NUMA nodes can be merged.
- Total pages shared.
- Pages currently shared.
- Pages not scanned.
- Pages no longer shared.
- Number of volatile pages.
- Whether the KSM process is running.
- Sleep milliseconds.
virsh node-memory-tunecommand. For example, the following specifies the number of pages to scan before the shared memory service goes to sleep:
# virsh node-memory-tune --shm-pages-to-scan number
/var/log/ksmtunedlog file if the
DEBUG=1line is added to the
/etc/ksmtuned.conffile. The log file location can be changed with the
LOGFILEparameter. Changing the log file location is not advised and may require special configuration of SELinux settings.
8.3.4. Deactivating KSM
ksmservices. However, this action does not persist after restarting. To deactivate KSM, run the following in a terminal as root:
# systemctl stop ksmtuned Stopping ksmtuned: [ OK ] # systemctl stop ksm Stopping ksm: [ OK ]
ksmdeactivates KSM, but this action does not persist after restarting. Persistently deactivate KSM with the
# systemctl disable ksm# systemctl disable ksmtuned
# echo 2 >/sys/kernel/mm/ksm/run
khugepageddaemon can rebuild transparent hugepages on the KVM guest physical memory. Using #
echo 0 >/sys/kernel/mm/ksm/runstops KSM, but does not unshare all the previously created KSM pages (this is the same as the #
systemctl stop ksmtunedcommand).
Chapter 9. NUMA
9.1. NUMA Memory Allocation Policies
- Strict policy means that the allocation will fail if the memory cannot be allocated on the target node.Specifying a NUMA nodeset list without defining a memory mode attribute defaults to
- Memory pages are allocated across nodes specified by a nodeset, but are allocated in a round-robin fashion.
- Memory is allocated from a single preferred memory node. If sufficient memory is not available, memory can be allocated from other nodes.
<memory mode>element of the domain XML file:
<numatune> <memory mode='preferred' nodeset='0'> </numatune>
strictmode and the guest does not have sufficient swap space, the kernel will kill some guest processes to retrieve additional memory. Red Hat recommends using
preferredallocation and specifying a single nodeset (for example, nodeset='0') to prevent this situation.
9.2. Automatic NUMA Balancing
- Periodic NUMA unmapping of process memory
- NUMA hinting fault
- Migrate-on-Fault (MoF) - moves memory to where the program using it runs
- task_numa_placement - moves running programs closer to their memory
9.2.1. Configuring Automatic NUMA Balancing
# numactl --hardwareshows multiple nodes
# cat /proc/sys/kernel/numa_balancingshows
# echo 0 > /proc/sys/kernel/numa_balancing
# echo 1 > /proc/sys/kernel/numa_balancing
9.3. libvirt NUMA Tuning
numastattool to view per-NUMA-node memory statistics for processes and the operating system.
numastattool shows four virtual machines with suboptimal memory alignment across NUMA nodes:
# numastat -c qemu-kvmPer-node process memory usage (in MBs) PID Node 0 Node 1 Node 2 Node 3 Node 4 Node 5 Node 6 Node 7 Total --------------- ------ ------ ------ ------ ------ ------ ------ ------ ----- 51722 (qemu-kvm) 68 16 357 6936 2 3 147 598 8128 51747 (qemu-kvm) 245 11 5 18 5172 2532 1 92 8076 53736 (qemu-kvm) 62 432 1661 506 4851 136 22 445 8116 53773 (qemu-kvm) 1393 3 1 2 12 0 0 6702 8114 --------------- ------ ------ ------ ------ ------ ------ ------ ------ ----- Total 1769 463 2024 7462 10037 2672 169 7837 32434
numadto align the guests' CPUs and memory resources automatically.
numastat -c qemu-kvmagain to view the results of running
numad. The following output shows that resources have been aligned:
# numastat -c qemu-kvmPer-node process memory usage (in MBs) PID Node 0 Node 1 Node 2 Node 3 Node 4 Node 5 Node 6 Node 7 Total --------------- ------ ------ ------ ------ ------ ------ ------ ------ ----- 51747 (qemu-kvm) 0 0 7 0 8072 0 1 0 8080 53736 (qemu-kvm) 0 0 7 0 0 0 8113 0 8120 53773 (qemu-kvm) 0 0 7 0 0 0 1 8110 8118 59065 (qemu-kvm) 0 0 8050 0 0 0 0 0 8051 --------------- ------ ------ ------ ------ ------ ------ ------ ------ ----- Total 0 0 8072 0 8072 0 8114 8110 32368
-cprovides compact output; adding the
-moption adds system-wide memory information on a per-node basis to the output. Refer to the
numastatman page for more information.
9.3.1. Monitoring Memory per host NUMA Node
nodestats.pyscript to report the total memory and free memory for each NUMA node on a host. This script also reports how much memory is strictly bound to certain host nodes for each running domain. For example:
# /usr/share/doc/libvirt-python-2.0.0/examples/nodestats.pyNUMA stats NUMA nodes: 0 1 2 3 MemTotal: 3950 3967 3937 3943 MemFree: 66 56 42 41 Domain 'rhel7-0': Overall memory: 1536 MiB Domain 'rhel7-1': Overall memory: 2048 MiB Domain 'rhel6': Overall memory: 1024 MiB nodes 0-1 Node 0: 1024 MiB nodes 0-1 Domain 'rhel7-2': Overall memory: 4096 MiB nodes 0-3 Node 0: 1024 MiB nodes 0 Node 1: 1024 MiB nodes 1 Node 2: 1024 MiB nodes 2 Node 3: 1024 MiB nodes 3
MemTotal). Nearly all memory is consumed on each domain (
MemFree). There are four domains (virtual machines) running: domain 'rhel7-0' has 1.5GB memory which is not pinned onto any specific host NUMA node. Domain 'rhel7-2' however, has 4GB memory and 4 NUMA nodes which are pinned 1:1 to host nodes.
nodestats.pyscript for your environment. An example script can be found the libvirt-python package files in
/usr/share/doc/libvirt-python-version/examples/nodestats.py. The specific path to the script can be displayed by using the
rpl -ql libvirt-pythoncommand.
9.3.2. NUMA vCPU Pinning
numatunecan avoid NUMA misses. The performance impacts of NUMA misses are significant, generally starting at a 10% performance hit or higher. vCPU pinning and
numatuneshould be configured together.
<vcpu cpuset='0-7'>8</vcpu> <cputune> <vcpupin vcpu='0' cpuset='0'/> <vcpupin vcpu='1' cpuset='1'/> <vcpupin vcpu='2' cpuset='2'/> <vcpupin vcpu='3' cpuset='3'/> <vcpupin vcpu='4' cpuset='4'/> <vcpupin vcpu='5' cpuset='5'/> <vcpupin vcpu='6' cpuset='6'/> <vcpupin vcpu='7' cpuset='7'/> </cputune>
<vcpupin>for vcpu 5 missing. Hence, vCPU5 would be pinned to physical CPUs 0-7, as specified in the parent tag
<vcpu cpuset='0-7'>8</vcpu> <cputune> <vcpupin vcpu='0' cpuset='0'/> <vcpupin vcpu='1' cpuset='1'/> <vcpupin vcpu='2' cpuset='2'/> <vcpupin vcpu='3' cpuset='3'/> <vcpupin vcpu='4' cpuset='4'/> <vcpupin vcpu='6' cpuset='6'/> <vcpupin vcpu='7' cpuset='7'/> </cputune>
9.3.3. Domain Processes
<numatune> <memory mode='strict' placement='auto'/> </numatune>
<numatune> <memory mode='strict' nodeset='0,2-3'/> </numatune>
<vcpu placement='static' cpuset='0-10,ˆ5'>8</vcpu>
- The placement mode for
<numatune>defaults to the same placement mode of
<vcpu>, or to static if a
- Similarly, the placement mode for
<vcpu>defaults to the same placement mode of
<numatune>, or to static if
<vcpu placement='auto' current='8'>32</vcpu>
9.3.4. Domain vCPU Threads
<cputune> <vcpupin vcpu="0" cpuset="1-4,ˆ2"/> <vcpupin vcpu="1" cpuset="0,1"/> <vcpupin vcpu="2" cpuset="2,3"/> <vcpupin vcpu="3" cpuset="0,4"/> </cputune>
<cputune>, refer to the following URL: http://libvirt.org/formatdomain.html#elementsCPUTuning
4 available nodes (0-3) Node 0: CPUs 0 4, size 4000 MiB Node 1: CPUs 1 5, size 3999 MiB Node 2: CPUs 2 6, size 4001 MiB Node 3: CPUs 0 4, size 4005 MiB
<cputune> <vcpupin vcpu="0" cpuset="1"/> <vcpupin vcpu="1" cpuset="5"/> <vcpupin vcpu="2" cpuset="2"/> <vcpupin vcpu="3" cpuset="6"/> </cputune> <numatune> <memory mode="strict" nodeset="1-2"/> </numatune> <cpu> <numa> <cell id="0" cpus="0-1" memory="3" unit="GiB"/> <cell id="1" cpus="2-3" memory="3" unit="GiB"/> </numa> </cpu>
9.3.5. Using emulatorpin
<emulatorpin>tag inside of
<emulatorpin>tag specifies which host physical CPUs the emulator (a subset of a domain, not including vCPUs) will be pinned to. The
<emulatorpin>tag provides a method of setting a precise affinity to emulator thread processes. As a result, vhost threads run on the same subset of physical CPUs and memory, and therefore benefit from cache locality. For example:
<cputune> <emulatorpin cpuset="1-3"/> </cputune>
<emulatorpin>, since the vhost-net emulator thread follows the vCPU tasks more reliably. For more information about automatic NUMA balancing, see Section 9.2, “Automatic NUMA Balancing”.
9.3.6. Tuning vCPU Pinning with virsh
virshcommand will pin the vcpu thread rhel7 which has an ID of 1 to the physical CPU 2:
% virsh vcpupin rhel7 1 2
virshcommand. For example:
% virsh vcpupin rhel7
9.3.7. Tuning Domain Process CPU Pinning with virsh
emulatorpinoption applies CPU affinity settings to threads that are associated with each domain process. For complete pinning, you must use both
virsh vcpupin(as shown previously) and
virsh emulatorpinfor each guest. For example:
% virsh emulatorpin rhel7 3-4
9.3.8. Tuning Domain Process Memory Policy with virsh
% virsh numatune rhel7 --nodeset 0-10
9.3.9. Guest NUMA Topology
<numa>tag inside the
<cpu>tag in the guest virtual machine's XML. Refer to the following example, and replace values accordingly:
<cpu> ... <numa> <cell cpus='0-3' memory='512000'/> <cell cpus='4-7' memory='512000'/> </numa> ... </cpu>
<cell>element specifies a NUMA cell or a NUMA node.
cpusspecifies the CPU or range of CPUs that are part of the node, and
memoryspecifies the node memory in kibibytes (blocks of 1024 bytes). Each cell or node is assigned a
nodeidin increasing order starting from 0.
9.3.10. NUMA Node Locality for PCI Devices
/sys/devices/pci*/*/numa_node. One way to verify these settings is to use the lstopo tool to report
# lstopo-no-graphicsMachine (126GB) NUMANode L#0 (P#0 63GB) Socket L#0 + L3 L#0 (20MB) L2 L#0 (256KB) + L1d L#0 (32KB) + L1i L#0 (32KB) + Core L#0 + PU L#0 (P#0) L2 L#1 (256KB) + L1d L#1 (32KB) + L1i L#1 (32KB) + Core L#1 + PU L#1 (P#2) L2 L#2 (256KB) + L1d L#2 (32KB) + L1i L#2 (32KB) + Core L#2 + PU L#2 (P#4) L2 L#3 (256KB) + L1d L#3 (32KB) + L1i L#3 (32KB) + Core L#3 + PU L#3 (P#6) L2 L#4 (256KB) + L1d L#4 (32KB) + L1i L#4 (32KB) + Core L#4 + PU L#4 (P#8) L2 L#5 (256KB) + L1d L#5 (32KB) + L1i L#5 (32KB) + Core L#5 + PU L#5 (P#10) L2 L#6 (256KB) + L1d L#6 (32KB) + L1i L#6 (32KB) + Core L#6 + PU L#6 (P#12) L2 L#7 (256KB) + L1d L#7 (32KB) + L1i L#7 (32KB) + Core L#7 + PU L#7 (P#14) HostBridge L#0 PCIBridge PCI 8086:1521 Net L#0 "em1" PCI 8086:1521 Net L#1 "em2" PCI 8086:1521 Net L#2 "em3" PCI 8086:1521 Net L#3 "em4" PCIBridge PCI 1000:005b Block L#4 "sda" Block L#5 "sdb" Block L#6 "sdc" Block L#7 "sdd" PCIBridge PCI 8086:154d Net L#8 "p3p1" PCI 8086:154d Net L#9 "p3p2" PCIBridge PCIBridge PCIBridge PCIBridge PCI 102b:0534 GPU L#10 "card0" GPU L#11 "controlD64" PCI 8086:1d02 NUMANode L#1 (P#1 63GB) Socket L#1 + L3 L#1 (20MB) L2 L#8 (256KB) + L1d L#8 (32KB) + L1i L#8 (32KB) + Core L#8 + PU L#8 (P#1) L2 L#9 (256KB) + L1d L#9 (32KB) + L1i L#9 (32KB) + Core L#9 + PU L#9 (P#3) L2 L#10 (256KB) + L1d L#10 (32KB) + L1i L#10 (32KB) + Core L#10 + PU L#10 (P#5) L2 L#11 (256KB) + L1d L#11 (32KB) + L1i L#11 (32KB) + Core L#11 + PU L#11 (P#7) L2 L#12 (256KB) + L1d L#12 (32KB) + L1i L#12 (32KB) + Core L#12 + PU L#12 (P#9) L2 L#13 (256KB) + L1d L#13 (32KB) + L1i L#13 (32KB) + Core L#13 + PU L#13 (P#11) L2 L#14 (256KB) + L1d L#14 (32KB) + L1i L#14 (32KB) + Core L#14 + PU L#14 (P#13) L2 L#15 (256KB) + L1d L#15 (32KB) + L1i L#15 (32KB) + Core L#15 + PU L#15 (P#15) HostBridge L#8 PCIBridge PCI 1924:0903 Net L#12 "p1p1" PCI 1924:0903 Net L#13 "p1p2" PCIBridge PCI 15b3:1003 Net L#14 "ib0" Net L#15 "ib1" OpenFabrics L#16 "mlx4_0"
sd*are connected to NUMA node 0 and cores 0,2,4,6,8,10,12,14.
ib*are connected to NUMA node 1 and cores 1,3,5,7,9,11,13,15.
9.4. NUMA-Aware Kernel SamePage Merging (KSM)
sysfs /sys/kernel/mm/ksm/merge_across_nodesparameter to control merging of pages across different NUMA nodes. By default, pages from all nodes can be merged together. When this parameter is set to zero, only pages from the same node are merged.
<memoryBacking> <nosharepages/> </memoryBacking>
<memoryBacking>element, see Section 8.2.2, “Memory Tuning with virsh”.
Appendix A. Revision History
|Revision 1.0-30||Fri Jan 05 2018|
|Revision 1.0-27||Mon Jul 27 2017|
|Revision 1.0-24||Mon Oct 17 2016|
|Revision 1.0-22||Mon Dec 21 2015|
|Revision 1.0-19||Thu Oct 08 2015|