Chapter 5. NFV Performance Considerations
For an NFV solution to be useful, its virtualized functions must meet or exceed the performance of physical implementations. Red Hat’s virtualization technologies are based on the high-performance Kernel-based Virtual Machine (KVM) hypervisor, common in OpenStack and cloud deployments.
5.1. CPUs and NUMA nodes
Previously, all memory on x86 systems was equally accessible to all CPUs in the system. This resulted in memory access times that were the same regardless of which CPU in the system was performing the operation and was referred to as Uniform Memory Access (UMA).
In Non-Uniform Memory Access (NUMA), system memory is divided into zones called nodes, which are allocated to particular CPUs or sockets. Access to memory that is local to a CPU is faster than memory connected to remote CPUs on that system. Normally, each socket on a NUMA system has a local memory node whose contents can be accessed faster than the memory in the node local to another CPU or the memory on a bus shared by all CPUs.
Similarly, physical NICs are placed in PCI slots on the Compute node hardware. These slots connect to specific CPU sockets which are associated to a particular NUMA node. For optimum performance, connect your datapath NICs to the same NUMA nodes in your CPU configuration (SR-IOV or OVS-DPDK).
The performance impact of NUMA misses are significant, generally starting at a 10% performance hit or higher. Each CPU socket can have multiple CPU cores which are treated as individual CPUs for virtualization purposes.
OpenStack Compute makes smart scheduling and placement decisions when launching instances. Administrators who want to take advantage of these features can create customized performance flavors to target specialized workloads including NFV and High Performance Computing (HPC).
Background information about NUMA is available in the following article: What is NUMA and how does it work on Linux?
5.2. NUMA node example
The following diagram provides an example of a two-node NUMA system and the way the CPU cores and memory pages are made available:
Remote memory available via Interconnect is accessed only if VM1 from NUMA node 0 has a CPU core in NUMA node 1. In this case, the memory of NUMA node 1 will act as local for the third CPU core of VM1 (for example, if VM1 is allocated with CPU 4 in the diagram above), but at the same time, it will act as remote memory for the other CPU cores of the same VM.
5.3. CPU pinning
CPU pinning is the ability to run a specific virtual machine’s virtual CPU on a specific physical CPU, in a given host. vCPU pinning provides similar advantages to task pinning on bare-metal systems. Since virtual machines run as user space tasks on the host operating system, pinning increases cache efficiency.
See Configure CPU Pinning with NUMA for further information.
Physical memory is segmented into contiguous regions called pages. For efficiency, the system retrieves memory by accessing entire pages instead of individual bytes of memory. To perform this translation, the system looks in the Translation Lookaside Buffers (TLB) which contain the physical to virtual address mappings for the most recently or frequently used pages. When a mapping being searched for is not in the TLB, the processor must iterate through all the page tables to determine the address mappings. This causes a performance penalty. It is therefore preferable to optimise the TLB so as to ensure the target process can avoid the TLB misses as much as possible.
The typical page size in an x86 system is 4KB, with other larger page sizes available. Larger page sizes mean that there are fewer pages overall, and therefore increases the amount of system memory that can have its virtual to physical address translation stored in the TLB, and as a result lowers the potential for TLB misses, which increases performance. With larger page sizes, there is an increased potential for memory to be wasted as processes must allocate in pages, but not all of the memory is likely required. As a result, choosing a page size is a trade off between providing faster access times by using larger pages and ensuring maximum memory utilization by using smaller pages.