Red Hat Training

A Red Hat training course is available for Red Hat Enterprise Linux

9.3. Configuration Tools

Red Hat Enterprise Linux provides a number of tools to assist administrators in configuring the system. This section outlines the available tools and provides examples of how they can be used to solve network related performance problems in Red Hat Enterprise Linux 7.
However, it is important to keep in mind that network performance problems are sometimes the result of hardware malfunction or faulty infrastructure. Red Hat highly recommends verifying that your hardware and infrastructure are working as expected before using these tools to tune the network stack.
Further, some network performance problems are better resolved by altering the application than by reconfiguring your network subsystem. It is generally a good idea to configure your application to perform frequent posix calls, even if this means queuing data in the application space, as this allows data to be stored flexibly and swapped in or out of memory as required.

9.3.1. Tuned Profiles for Network Performance

The Tuned service provides a number of different profiles to improve performance in a number of specific use cases. The following profiles can be useful for improving networking performance.
  • latency-performance
  • network-latency
  • network-throughput
For more information about these profiles, see Section A.5, “tuned-adm”.

9.3.2. Configuring the Hardware Buffer

If a large number of packets are being dropped by the hardware buffer, there are a number of potential solutions.
Slow the input traffic
Filter incoming traffic, reduce the number of joined multicast groups, or reduce the amount of broadcast traffic to decrease the rate at which the queue fills. For details of how to filter incoming traffic, see the Red Hat Enterprise Linux 7 Security Guide. For details about multicast groups, see the Red Hat Enterprise Linux 7 Clustering documentation. For details about broadcast traffic, see the Red Hat Enterprise Linux 7 System Administrator's Guide, or documentation related to the device you want to configure.
Resize the hardware buffer queue
Reduce the number of packets being dropped by increasing the size of the queue so that the it does not overflow as easily. You can modify the rx/tx parameters of the network device with the ethtool command:
# ethtool --set-ring devname value
Change the drain rate of the queue
Device weight refers to the number of packets a device can receive at one time (in a single scheduled processor access). You can increase the rate at which a queue is drained by increasing its device weight, which is controlled by the dev_weight parameter. This parameter can be temporarily altered by changing the contents of the /proc/sys/net/core/dev_weight file, or permanently altered with sysctl, which is provided by the procps-ng package.
Altering the drain rate of a queue is usually the simplest way to mitigate poor network performance. However, increasing the number of packets that a device can receive at one time uses additional processor time, during which no other processes can be scheduled, so this can cause other performance problems.

9.3.3. Configuring Interrupt Queues

If analysis reveals high latency, your system may benefit from poll-based rather than interrupt-based packet receipt.

9.3.3.1. Configuring Busy Polling

Busy polling helps reduce latency in the network receive path by allowing socket layer code to poll the receive queue of a network device, and disabling network interrupts. This removes delays caused by the interrupt and the resultant context switch. However, it also increases CPU utilization. Busy polling also prevents the CPU from sleeping, which can incur additional power consumption.
Busy polling is disabled by default. To enable busy polling on specific sockets, do the following.
  • Set sysctl.net.core.busy_poll to a value other than 0. This parameter controls the number of microseconds to wait for packets on the device queue for socket poll and selects. Red Hat recommends a value of 50.
  • Add the SO_BUSY_POLL socket option to the socket.
To enable busy polling globally, you must also set sysctl.net.core.busy_read to a value other than 0. This parameter controls the number of microseconds to wait for packets on the device queue for socket reads. It also sets the default value of the SO_BUSY_POLL option. Red Hat recommends a value of 50 for a small number of sockets, and a value of 100 for large numbers of sockets. For extremely large numbers of sockets (more than several hundred), use epoll instead.
Busy polling behavior is supported by the following drivers. These drivers are also supported on Red Hat Enterprise Linux 7.
  • bnx2x
  • be2net
  • ixgbe
  • mlx4
  • myri10ge
As of Red Hat Enterprise Linux 7.1, you can also run the following command to check whether a specific device supports busy polling.
# ethtool -k device | grep "busy-poll"
If this returns busy-poll: on [fixed], busy polling is available on the device.

9.3.4. Configuring Socket Receive Queues

If analysis suggests that packets are being dropped because the drain rate of a socket queue is too slow, there are several ways to alleviate the performance issues that result.
Decrease the speed of incoming traffic
Decrease the rate at which the queue fills by filtering or dropping packets before they reach the queue, or by lowering the weight of the device.
Increase the depth of the application's socket queue
If a socket queue that receives a limited amount of traffic in bursts, increasing the depth of the socket queue to match the size of the bursts of traffic may prevent packets from being dropped.

9.3.4.1. Decrease the Speed of Incoming Traffic

Filter incoming traffic or lower the network interface card's device weight to slow incoming traffic. For details of how to filter incoming traffic, see the Red Hat Enterprise Linux 7 Security Guide.
Device weight refers to the number of packets a device can receive at one time (in a single scheduled processor access). Device weight is controlled by the dev_weight parameter. This parameter can be temporarily altered by changing the contents of the /proc/sys/net/core/dev_weight file, or permanently altered with sysctl, which is provided by the procps-ng package.

9.3.4.2. Increasing Queue Depth

Increasing the depth of an application socket queue is typically the easiest way to improve the drain rate of a socket queue, but it is unlikely to be a long-term solution.
To increase the depth of a queue, increase the size of the socket receive buffer by making either of the following changes:
Increase the value of /proc/sys/net/core/rmem_default
This parameter controls the default size of the receive buffer used by sockets. This value must be smaller than or equal to the value of /proc/sys/net/core/rmem_max.
Use setsockopt to configure a larger SO_RCVBUF value
This parameter controls the maximum size in bytes of a socket's receive buffer. Use the getsockopt system call to determine the current value of the buffer. For further information, see the socket(7) manual page.

9.3.5. Configuring Receive-Side Scaling (RSS)

Receive-Side Scaling (RSS), also known as multi-queue receive, distributes network receive processing across several hardware-based receive queues, allowing inbound network traffic to be processed by multiple CPUs. RSS can be used to relieve bottlenecks in receive interrupt processing caused by overloading a single CPU, and to reduce network latency.
To determine whether your network interface card supports RSS, check whether multiple interrupt request queues are associated with the interface in /proc/interrupts. For example, if you are interested in the p1p1 interface:
# egrep 'CPU|p1p1' /proc/interrupts
   CPU0    CPU1    CPU2    CPU3    CPU4    CPU5
89:   40187       0       0       0       0       0   IR-PCI-MSI-edge   p1p1-0
90:       0     790       0       0       0       0   IR-PCI-MSI-edge   p1p1-1
91:       0       0     959       0       0       0   IR-PCI-MSI-edge   p1p1-2
92:       0       0       0    3310       0       0   IR-PCI-MSI-edge   p1p1-3
93:       0       0       0       0     622       0   IR-PCI-MSI-edge   p1p1-4
94:       0       0       0       0       0    2475   IR-PCI-MSI-edge   p1p1-5
The preceding output shows that the NIC driver created 6 receive queues for the p1p1 interface (p1p1-0 through p1p1-5). It also shows how many interrupts were processed by each queue, and which CPU serviced the interrupt. In this case, there are 6 queues because by default, this particular NIC driver creates one queue per CPU, and this system has 6 CPUs. This is a fairly common pattern among NIC drivers.
Alternatively, you can check the output of ls -1 /sys/devices/*/*/device_pci_address/msi_irqs after the network driver is loaded. For example, if you are interested in a device with a PCI address of 0000:01:00.0, you can list the interrupt request queues of that device with the following command:
# ls -1 /sys/devices/*/*/0000:01:00.0/msi_irqs
101
102
103
104
105
106
107
108
109
RSS is enabled by default. The number of queues (or the CPUs that should process network activity) for RSS are configured in the appropriate network device driver. For the bnx2x driver, it is configured in num_queues. For the sfc driver, it is configured in the rss_cpus parameter. Regardless, it is typically configured in /sys/class/net/device/queues/rx-queue/, where device is the name of the network device (such as eth1) and rx-queue is the name of the appropriate receive queue.
When configuring RSS, Red Hat recommends limiting the number of queues to one per physical CPU core. Hyper-threads are often represented as separate cores in analysis tools, but configuring queues for all cores including logical cores such as hyper-threads has not proven beneficial to network performance.
When enabled, RSS distributes network processing equally between available CPUs based on the amount of processing each CPU has queued. However, you can use the ethtool --show-rxfh-indir and --set-rxfh-indir parameters to modify how network activity is distributed, and weight certain types of network activity as more important than others.
The irqbalance daemon can be used in conjunction with RSS to reduce the likelihood of cross-node memory transfers and cache line bouncing. This lowers the latency of processing network packets.

9.3.6. Configuring Receive Packet Steering (RPS)

Receive Packet Steering (RPS) is similar to RSS in that it is used to direct packets to specific CPUs for processing. However, RPS is implemented at the software level, and helps to prevent the hardware queue of a single network interface card from becoming a bottleneck in network traffic.
RPS has several advantages over hardware-based RSS:
  • RPS can be used with any network interface card.
  • It is easy to add software filters to RPS to deal with new protocols.
  • RPS does not increase the hardware interrupt rate of the network device. However, it does introduce inter-processor interrupts.
RPS is configured per network device and receive queue, in the /sys/class/net/device/queues/rx-queue/rps_cpus file, where device is the name of the network device (such as eth0) and rx-queue is the name of the appropriate receive queue (such as rx-0).
The default value of the rps_cpus file is 0. This disables RPS, so the CPU that handles the network interrupt also processes the packet.
To enable RPS, configure the appropriate rps_cpus file with the CPUs that should process packets from the specified network device and receive queue.
The rps_cpus files use comma-delimited CPU bitmaps. Therefore, to allow a CPU to handle interrupts for the receive queue on an interface, set the value of their positions in the bitmap to 1. For example, to handle interrupts with CPUs 0, 1, 2, and 3, set the value of rps_cpus to f, which is the hexadecimal value for 15. In binary representation, 15 is 00001111 (1+2+4+8).
For network devices with single transmit queues, best performance can be achieved by configuring RPS to use CPUs in the same memory domain. On non-NUMA systems, this means that all available CPUs can be used. If the network interrupt rate is extremely high, excluding the CPU that handles network interrupts may also improve performance.
For network devices with multiple queues, there is typically no benefit to configuring both RPS and RSS, as RSS is configured to map a CPU to each receive queue by default. However, RPS may still be beneficial if there are fewer hardware queues than CPUs, and RPS is configured to use CPUs in the same memory domain.

9.3.7. Configuring Receive Flow Steering (RFS)

Receive Flow Steering (RFS) extends RPS behavior to increase the CPU cache hit rate and thereby reduce network latency. Where RPS forwards packets based solely on queue length, RFS uses the RPS back end to calculate the most appropriate CPU, then forwards packets based on the location of the application consuming the packet. This increases CPU cache efficiency.
RFS is disabled by default. To enable RFS, you must edit two files:
/proc/sys/net/core/rps_sock_flow_entries
Set the value of this file to the maximum expected number of concurrently active connections. We recommend a value of 32768 for moderate server loads. All values entered are rounded up to the nearest power of 2 in practice.
/sys/class/net/device/queues/rx-queue/rps_flow_cnt
Replace device with the name of the network device you wish to configure (for example, eth0), and rx-queue with the receive queue you wish to configure (for example, rx-0).
Set the value of this file to the value of rps_sock_flow_entries divided by N, where N is the number of receive queues on a device. For example, if rps_flow_entries is set to 32768 and there are 16 configured receive queues, rps_flow_cnt should be set to 2048. For single-queue devices, the value of rps_flow_cnt is the same as the value of rps_sock_flow_entries.
Data received from a single sender is not sent to more than one CPU. If the amount of data received from a single sender is greater than a single CPU can handle, configure a larger frame size to reduce the number of interrupts and therefore the amount of processing work for the CPU. Alternatively, consider NIC offload options or faster CPUs.
Consider using numactl or taskset in conjunction with RFS to pin applications to specific cores, sockets, or NUMA nodes. This can help prevent packets from being processed out of order.

9.3.8. Configuring Accelerated RFS

Accelerated RFS boosts the speed of RFS by adding hardware assistance. Like RFS, packets are forwarded based on the location of the application consuming the packet. Unlike traditional RFS, however, packets are sent directly to a CPU that is local to the thread consuming the data: either the CPU that is executing the application, or a CPU local to that CPU in the cache hierarchy.
Accelerated RFS is only available if the following conditions are met:
  • Accelerated RFS must be supported by the network interface card. Accelerated RFS is supported by cards that export the ndo_rx_flow_steer() netdevice function.
  • ntuple filtering must be enabled.
Once these conditions are met, CPU to queue mapping is deduced automatically based on traditional RFS configuration. That is, CPU to queue mapping is deduced based on the IRQ affinities configured by the driver for each receive queue. Refer to Section 9.3.7, “Configuring Receive Flow Steering (RFS)” for details on configuring traditional RFS.
Red Hat recommends using accelerated RFS wherever using RFS is appropriate and the network interface card supports hardware acceleration.