Chapter 8. Performance and Optimization
The fundamental business model for Communication Service Providers (CSPs) and Telcos is based on providing mission critical applications to a large pool of subscribers with least amount of service disruption. Earlier we highlighted that there are two key requirements for NFV (others requirements also important). Performance and high availability. Although high availability was covered in an earlier section, it should not be construed as being a higher priority than performance. The need for high performance for CSPs and Telcos stems from the fact that they need to be able to support the highest number of subscribers using the lowest amount of resources so as to maximize profits. Without being able to achieve very high throughputs that are comparable to what is achievable with purpose built hardware solutions, this whole business model breaks down.
The journey began with using Linux bridges, virtio and OVS. As demand for higher performance grew in NFV, PCI passthrough, SR-IOV, OVS with DPDK and Vector Packet Processing (VPP) were introduced to meet the demand. We cover each of them in the following sections and their use in vEPC. The higher performance requirement also applies to GiLAN when deployed because GiLAN is in the dataplane.
8.1. Open vSwitch
Open vSwitch (OVS) is an open source software switch designed to be used as a vSwitch within virtualized server environments. OVS supports many of the capabilities you would expect from a traditional switch, but also offers support for “SDN ready” interfaces and protocols such as OpenFlow and OVSDB. Red Hat recommends Open vSwitch for Red Hat OpenStack Platform deployments, and offers out of the box OpenStack Networking (Neutron) integration with OVS.
Standard OVS (Figure 16) is built out of three main components:
- ovs-vswitchd – a user-space daemon that implements the switch logic
- kernel module (fast path) – that processes received frames based on a lookup table
- ovsdb-server – a database server that ovs-vswitchd queries to obtain its configuration. External clients can talk to ovsdb-server using OVSDB protocol
When a frame is received, the fast path (kernel space) uses match fields from the frame header to determine the flow table entry and the set of actions to execute. If the frame does not match any entry in the lookup table it is sent to the user-space daemon (vswitchd) which requires more CPU processing. The user-space daemon then determines how to handle frames of this type and sets the right entries in the fast path lookup tables.
vEPC VNF has several supports several applications as we have discussed earlier - data,
voice and video. Each application has different tolerance to delay (latency), jitter and packet loss. This is summarized in Table 2. 3GPP specifies using AMR-NB (narrow band) codec for Voice over LTE (VoLTE). However, many operators also offer HD voice which uses AMR-WB. Actual codecs deployed by operators vary. In general when it comes to frame loss. The voice applications require far less than 1% frame loss. The closer to zero frame loss the better.
OVS has several ports: outbound ports which are connected to the physical NICs on the host using kernel device drivers, and inbound ports which are connected to VMs. The VM guest operating system (OS) is presented with vNICs using the well-known VirtIO paravirtualized network driver.
Figure 16: Standard OVS architecture; user-space and kernel space layers
While some users may find acceptable performance numbers with the standard OVS, it was never designed with NFV in mind and does not meet some of the requirements we are starting to see from VNFs. Thus Red Hat, Intel, and others have contributed to enhance OVS by utilizing the Data Plane Development Kit (DPDK), boosting its performance to meet NFV demands or entirely bypass OVS using PCI passthrough or SR-IOV.
With traditional cloud applications such as Wordpress, Apache web servers and databases, the applications were not network centric. One could get away with the performance delivered by virtio and then stacking VMs to scale the number of users (subscribers). For NFV workloads this is no longer adequate. In order to improve network performance for VMs initial approach was to use PCI passthrough.
Mobile VNFs could still use standard OVS for management network etc that may not require very high throughput.
8.2. PCI Passthrough
Through Intel’s VT-d extension (IOMMU for AMD) it is possible to present PCI devices on the host system to the virtualized guest OS. This is supported by KVM (Kernel-based Virtual Machine). Using this technique it is possible to provide a guest VM exclusive access to a NIC. For all practical purposes, the VM thinks the NIC is directly connected to it.
Figure 17: PCI Passthrough: Issue
PCI passthrough suffers from one major shortcoming - a single interface eth0 on one of the VNF1 has complete access and ownership of the physical NIC. This can be observed in Figure 17 where VNF2 interface eth0 is left with no connectivity and the TenGig 1 NIC on the host is assigned to VM1/vNIC1. NEPs providing vEPC VNFs often run multiple VNFs per node and it is highly desirable to share the Ten GigE. Prior to advent of SR-IOV, NEPs used PCI passthrough mostly for demos and POCs. By the time vEPC went to production, SR-IOV was available and was chosen over passthrough.
8.3. SR-IOV (Single Root I/O Virtualization)
SR-IOV is a standard that makes a single PCI hardware device appears as multiple virtual PCI devices. SR-IOV works by introducing the idea of physical functions (PFs) and virtual functions (VFs). Physical functions (PFs) are the full-featured PCIe functions and represent the physical hardware ports; virtual functions (VFs) are lightweight functions that can be assigned to VMs. A suitable VF driver must reside within the VNF, which sees the VF as a regular NIC that communicates directly with the hardware. The number of virtual instances that can be presented depends upon the network adapter card and the device driver. For example, a single card with two ports might have two PFs each exposing 128 VFs. These are, of course, theoretical limits.
SR-IOV for network devices is supported starting with Red Hat OpenStack Platform 6, and described in more details in this two-series blog post.
While direct hardware access can provide near line-rate performance to the VNF, this approach limits the flexibility of the deployment as it breaks the software abstraction. VNFs must be initiated on Compute nodes where SR-IOV capable cards are placed, and the features offered to the VNF depend on the capabilities of the specific NIC hardware.
Steering traffic to/from the VNF (e.g based on MAC addresses and/or 802.1q VLAN IDs) is done by the NIC hardware and is not visible to the Compute layer. Even in the simple case where two VNFs are placed in the same Compute node and want to communicate with each other (i.e intra-host communication), traffic that goes out the source VNF must hit the physical network adapter on the host before it is switched back to the destination VNF. Due to that, features such as firewall filtering (OpenStack security-groups) or live migration are currently not available when using SR-IOV with OpenStack. Therefore, SR-IOV is a good fit for self-contained appliances, where minimum policy is expected to be enforced by OpenStack and the virtualization layer. Figure 18 shows how SR-IOV provides a mechanism for sharing the physical NIC.
It should be noted that each VM/VNF images will need to be built with the NIC driver that supports that NIC. Also SR-IOV is only supported on certain Intel NIC. http://www.intel.com/content/www/us/en/support/network-and-i-o/ethernet-products/000005722.html contains a FAQ from Intel which covers the NICS that support SR-IOV.
Figure 18: NIC sharing using SR-IOV
NEPs providing vEPC and GiLAN VNFs have been deploying SR-IOV in order to be able to get the performance required from the application and have the ability to share the physical NIC amongst VMs. However, a major drawback with SR-IOV is that the VNF images have to be packaged with the NIC drivers that support the specific PNIC. This is a departure from the goals of decoupling hardware from software and achieving abstraction. Additionally, live migration is not supported with SR-IOV as guests cannot access passed-through devices after migration.
A stock VM running on Qemu(Quick Emulator)/KVM in the absence of acceleration features will use virtio for data communication using the virtio_net paravirtualized driver. The virtio driver provides a ‘virtio ring’ that contains transmit/receive queues for the VM to use. The guest VM shares the queues with Qemu. When packets are received on these queues, Qemu will forward them to the host network. This forms a bottleneck and prevents higher throughput especially when we have smaller size packets which implies higher number of packets. Vhost is a solution which allows the guest VM running as a process in user-space to share virtual queues with the kernel driver running on the host OS directly. We will still require Qemu to setup the virtual queues on the PCI device (NIC in this case), however, the packets in the queue will no longer have to be processed by Qemu. This significantly improves network I/O performance.
8.5. Data Plane Development Kit (DPDK)
The Data Plane Development Kit (DPDK) consists of a set of libraries and user-space drivers for fast packet processing. It’s designed to run mostly in user-space enabling applications to perform their own packet processing operations directly from/to the NIC. This results in delivering up to wire speed performance for certain use cases depending on the processing depth by reducing some of the bulk.
The DPDK libraries only provide minimal packet operations within the application but enable receiving and sending packets with a minimum number of CPU cycles. It does not provide any networking stack and instead helps to bypass the kernel network stack in order to deliver high performance. It is also not intended to be a direct replacement for all the robust packet processing capabilities (L3 forwarding, IPsec, firewalling, etc.) already found in the kernel network stack (and in many cases these features aren’t available to DPDK applications.)
In particular, DPDK provides the most significant performance improvement for situations where your application needs to handle many small packets (~64 bytes). Traditionally, the Linux network stack doesn’t handle small packets very well and incurs a lot of processing overhead when dealing with these small packets thus restricting throughput. The rationale behind the processing overhead is somewhat of a compounded answer. However, it stems from that fact that the Linux network stack is designed to address the needs of general purpose networking applications where commodity hardware is used. It actually works quite well for this use case and not only does it support the various protocols across each of the different networks layers, but it can even function as a router. As such, it wasn’t designed (optimized) for cases where you may need to deal with the processing of just these very small packets.
At a high-level technical standpoint, there are several reasons for the bulk including overhead of allocating/deallocating socket buffers, complexity of the socket buffer (sk_buff) data structure, multiple memory copies, processing of packets layer by layer within the network stack, and context switching between kernel level and user-space applications - all of which leads to a CPU bottleneck when you have many small packets (resulting in inefficient data processing through the kernel.)
The DPDK libraries are designed to address many of these issues and provide a lightweight framework for getting packets directly to/from applications. The DPDK is broken up into several core components:
- Memory Manager
- Buffer Manager
- Queue Manager
- Flow Classification
- Poll Mode Drivers
8.5.1. Memory Manager
Responsible for allocating pools of objects in memory
A pool is created in hugepage memory space and uses a ring to store free objects
Also provides an alignment helper to ensure that objects are padded to spread them equally on all DRAM channels
8.5.2. Buffer Manager
Reduces by a significant amount the time the operating system spends allocating and deallocating buffers
Pre-allocates fixed size buffers which are stored in memory pools.
8.5.3. Queue Manager
Implements safe lockless (and fixed sized) queue instead of using spin-locks that allow different software components to process packets while avoiding unnecessary wait times.
8.5.4. Flow Classification
Provides an efficient mechanism which incorporates Intel Streaming SIMD Extensions (Intel SSE) to produce a hash based on tuple information so that packets may be placed into flows quickly for processing, thus greatly improving throughput
8.5.5. Poll Mode Drivers
Designed to work without asynchronous, interrupt-based signaling mechanisms, which greatly speeds up the packet pipeline at the cost of allocating a CPU core to be constantly polling for new packets.
For vEPC and GiLAN applications that want to use the underlying DPDK-accelerated OVS for high throughput, it is important to note that to fully take advantage of OVS-DPDK, the guest VMs will also have to be DPDK enabled.
8.6. DPDK-accelerated Open vSwitch (OVS-DPDK)
Open vSwitch can be bundled with DPDK for better performance, resulting in a DPDK-accelerated OVS (OVS+DPDK). At a high level, the idea is to replace the standard OVS kernel datapath with a DPDK-based datapath, creating a user-space vSwitch on the host, which is using DPDK internally for its packet forwarding. The nice thing about this architecture is that it is mostly transparent to users as the basic OVS features as well as the interfaces it exposes (such as OpenFlow, OVSDB, the command line, etc.) remains mostly the same. Figure 19 shows a comparison of standard OVS vs DPDK-accelerated OVS.
The development of OVS+DPDK is now part of the OVS project, and the code is maintained under openvswitch.org. The fact that DPDK has established an upstream community of its own was key for that, so we now have the two communities – OVS and DPDK – talking to each other in the open, and the codebase for DPDK-accelerated OVS available in the open source community.
Starting with the Red Hat OpenStack Platform 8, DPDK-accelerated Open vSwitch is available for customers and partners as a Technology Preview feature based on the work done in upstream OVS 2.4. With release of Red Hat OpenStack Platform 10, OVS-DPDK is fully supported. It includes tight integration with the Compute and Networking layers of OpenStack via enhancements made to the OVS Neutron plug-in and agent. The implementation is also expected to include support for dpdkvhostuser ports (using QEMU vhost-user) so that VMs can still use the standard VirtIO networking driver when communicating with OVS on the host.
Red Hat sees the main advantage of OVS+DPDK in the flexibility it offers. SR-IOV, as previously described, is tightly tied to the physical NIC, resulting in a lack of software abstraction on the hypervisor side. DPDK-accelerated OVS promises to fix that by offering the “best of both worlds”: performance on one hand, and flexibility and programmability through the virtualization layer on the other.
Figure 19: Standard OVS versus user-space OVS accelerated with DPDK
8.7. DPDK with Red Hat OpenStack Platform
Generally, we see two main use-cases for using DPDK with Red Hat and Red Hat OpenStack Platform.
- DPDK enabled applications, or VNFs, written on top of Red Hat Enterprise Linux as a guest operating system. Here we are talking about Network Functions that are taking advantage of DPDK as opposed to the standard kernel networking stack for enhanced performance.
- DPDK-accelerated Open vSwitch, running within Red Hat OpenStack Platform compute nodes (the hypervisors). Here it is all about boosting the performance of OVS and allowing for faster connectivity between VNFs.
We would like to highlight that if you want to run DPDK-accelerated OVS in the compute node, you do not necessarily have to run DPDK-enabled applications in the VNFs that plug into it. This can be seen as another layer of optimization, but these are two different fundamental use-cases.
Figure 20: Standard OVS with DPDK-enabled VNFs
Figure 21 : DPDK-accelerated OVS with standard VNF s
Additionally, it is possible to run DPDK-enabled VNFs without using OVS or OVS+DPDK on the Compute node, and utilize SR-IOV instead. This configuration requires a VF Poll Mode Driver (PMD) within the VM itself as shown in Figure 22.
Figure 22: SR-IOV with DPDK enabled VNFs
As we have mentioned earlier, the vEPC and GiLAN VNF requires very high throughput as it terminates millions of subscribers. As such, it is recommended the NEPs use the DPDK-accelerated OVS with DPDK enabled VNFs as shown in Figure 23. This option will provide the best throughput while still offering decoupling from specific NIC hardware.
Figure 23: DPDK-accelerated OVS with DPDK enabled VNFs
To improve the performance of OVS with DPDK, vHost user multiqueue support was introduced.
vHost user as explained earlier improved overall throughput by allowing the VM that runs in user space to bypass the QEMU and directly talk to the kernel memory using sockets. This standard vHost user setup looks like what is shown in Figure 24.
Figure 24: vHost single queue
This single queue on the guest VM can become a bottleneck. To overcome this problem and to further speedup packet transfer, vHost multiqueue can be employed. This is illustrated in Figure 25. More details on multiqueue vHost user is available at: https://software.intel.com/en-us/articles/configure-vhost-user-multiqueue-for-ovs-with-dpdk
Figure 25: vHost user multiqueue setup
8.8. NUMA topology
NUMA, or Non-Uniform Memory Access, is a shared memory architecture that describes the placement of main memory modules with respect to processors in a multiprocessor system. Like most every other processor architectural feature, ignorance of NUMA can result in subpar application memory performance (Reference: Optimizing applications for Numa. David Ott. https://software.intel.com/en-us/articles/optimizing-applications-for-numa)
8.8.1. CPU Socket Affinity
When running workloads on NUMA hosts it is important that the CPUs executing the processes are on the same node as the memory used. This ensures that all memory accesses are local to the NUMA node and thus not consuming the limited cross-node memory bandwidth, for example via Intel QuickPath Interconnect (QPI) links, which adds latency to memory accesses.
Red Hat OpenStack Platform 6 and later versions provide a NUMA aware scheduler that will consider the availability of NUMA resources when choosing the host to schedule on.
OpenStack NUMA scheduler consumes the information of the NUMA topology defined through the OpenStack API to place the resources consequently. When defining the NUMA topology is important to take into account the performance drawbacks of a wrong NUMA topology. Figure 26 shows an undesired NUMA resources placement, a core on socket 1 receives packets from and transmits packets to interfaces on socket 0. This is the worst case situation. (Reference: Network Function Virtualization: Virtualized BRAS with Linux* and Intel® Architecture https://networkbuilders.intel.com/docs/Network_Builders_RA_vBRAS_Final.pdf)
Figure 26: NUMA node Sub-optimal I/O scenarios
To avoid the previous situation, it’s recommended to select a NUMA topology of 1 numa node, so that the OpenStack Compute scheduler will deploy the resulting scenario shown in Figure 27.
Figure 27: NUMA node optimal I/O scenarios
The drawback of doing this is the VM might not fit into one NUMA node depending on the resources available and used on the hosts.
For virtual mobile as well as for all other NFV applications, it is also important that the I/O devices (NICs) conform to NUMA topology. Because NFV applications are network intensive, if the NICs are not collocated with the CPUs there is a significant performance hit (up to 30%)!
8.9. Huge pages
From a memory management perspective, the entire physical memory is divided into "frames" and the virtual memory is divided into "pages".
The memory management unit performs a translation of virtual memory address to physical memory address. The information regarding which virtual memory page maps to which physical frame is kept in a data structure called the "Page Table". Page table lookups are costly. In order to avoid performance hits due to this lookup, a fast lookup cache called Translation Lookaside Buffer (TLB) is maintained by most architectures. This lookup cache contains the virtual memory address to physical memory address mapping. So any virtual memory address which requires translation to the physical memory address is first compared with the translation lookaside buffer for a valid mapping.
When a valid address translation is not present in the TLB, it is called a "TLB miss". If a TLB miss occurs, the memory management unit will have to refer to the page tables to get the translation. This brings additional performance costs, hence it is important that we try to reduce the TLB misses.
Most modern virtualization hosts support a variety of memory page sizes. On x86 the smallest, used by the kernel by default, is 4kb, while large sizes include 2MB and 1GB. The CPU TLB cache has a limited size, so when there is a very large amount of RAM present and utilized, the cache efficiency can be fairly low which in turn increases memory access latency. By using larger page sizes, there are fewer entries needed in the TLB and thus its efficiency goes up. 
The ability to control the amount of page memory in kilobytes (KB) exposed to the guest has been available since Red Hat OpenStack Platform 6.
8.10. CPU Consumption
Dedicated EPC hardware may have dedicated processors to perform network I/O. In some implementations, NEPs use specialized ASICs (Application-Specific Integrated Circuits) to speedup network I/O. For vEPC/GiLAN and NFV in general (except for cases where we use OVS offload and other hardware assist techniques), network I/O is performed by CPU of the server. Typically, when a packet arrives on the input queue the CPU gets an interrupt and one of the cores is assigned to move the packet from the input queue to the VM. Simply put, network I/O consumes CPU. Because of this reason, it is important to dimension the servers with adequate horsepower to not perform classic processor bound tasks but also for network I/O operations.
vEPC vendors (true for other VNFs as well) provide guidelines for running each VNF and application (MME, SGW/PGW, ePDG). They usually recommend the number virtual CPUs (vCPUs) required for that application along with the DRAM and disk.
Physical CPU model is a moving target. Newer, better and faster CPUs are released by Intel and other vendors constantly. For validating this architecture the following specifications were used for server hardware:
- 2 Intel Xeon Processor E5-2650v4 12-Core
- 128GB of DRAM
- Intel x540 10G NIC cards
- 400GB SSD drives for Ceph monitors
- 300GB SAS drives for the host OS
- 6TB SAS drives for Ceph OSD
It should be noted that even with dataplane acceleration such as OVS with DPDK, in order to get adequate throughput, multiple CPUs may need to be assigned depending on the requirement following NUMA guidelines.
8.11. CPU Pinning
CPU pinning is a technique that allows processes/threads to have an affinity configured with one or multiple cores.
By configuring a CPU affinity, the scheduler is now restricted to only scheduling the thread to execute on one of the nominated cores. In a NUMA configuration, if specific NUMA memory is requested for a thread, this CPU affinity setting will help to ensure that the memory remains local to the thread. Figure 28 shows a logical view of how threads that can be pinned to specific CPU cores and memory can be allocated local to those cores. Details of this can be found in A Path to Line-Rate-Capable NFV Deployments with Intel® Architecture and the OpenStack® Juno Release
Figure 28: Pinning threads to CPU cores.