Chapter 31. Tuning the network performance

Tuning the network settings is a complex process with many factors to consider. For example, this includes the CPU-to-memory architecture, the amount of CPU cores, and more. Red Hat Enterprise Linux uses default settings that are optimized for most scenarios. However, in certain cases, it can be necessary to tune network settings to increase the throughput or latency or to solve problems, such as packet drops.

31.1. Configuring an operating system to optimize access to network resources

You can configure the operating system to present optimized access to network resources across their workloads. Network performance problems are sometimes the result of hardware malfunction or faulty infrastructure. Resolving these issues is beyond the scope of this document.

The TuneD service provides a several different profiles to improve performance in a several specific use cases:

  • latency-performance
  • network-latency
  • network-throughput

31.1.1. Tools for monitoring and diagnosing performance issues

The following are the available tools in Red Hat Enterprise Linux 9, which are used for monitoring system performance and diagnosing performance problems related to the networking subsystem:

  • ss utility prints statistical information about sockets, enables administrators to assess device performance over time. By default, ss displays open non-listening sockets that have established connections. Using command-line options, administrators can filter out statistics about specific sockets. Red Hat recommends ss over the deprecated netstat in Red Hat Enterprise Linux
  • ip utility lets administrators manage and monitor routes, devices, routing policies, and tunnels. The ip monitor command can continuously monitor the state of devices, addresses, and routes. Use the -j option to display the output in JSON format, which can be further provided to other utilities to automate information processing.
  • dropwatch is an interactive tool, provided by the dropwatch package. It monitors and records packets that are dropped by the kernel.
  • ethtool utility enables administrators to view and edit network interface card settings. Use this tool to observe the statistics, such as the number of packets dropped by that device, of certain devices. Using the ethtool -S device name command, view the status of a specified device’s counters of the device you want to monitor.
  • /proc/net/snmp file displays data that the snmp agent uses for IP, ICMP, TCP and UDP monitoring and management. Examining this file on a regular basis helps administrators to identify unusual values and thereby identify potential performance problems. For example, an increase in UDP input errors (InErrors) in the /proc/net/snmp file can indicate a bottleneck in a socket receive queue.
  • nstat tool monitors kernel SNMP and network interface statistics. This tool reads data from the /proc/net/snmp file and prints the information in a human readable format.
  • By default, the SystemTap scripts, provided by the systemtap-client package are installed in the /usr/share/systemtap/examples/network directory:

    • nettop.stp: Every 5 seconds, the script displays a list of processes (process identifier and command) with the number of packets sent and received and the amount of data sent and received by the process during that interval.
    • socket-trace.stp: Instruments each of the functions in the Linux kernel’s net/socket.c file, and displays trace data.
    • dropwatch.stp: Every 5 seconds, the script displays the number of socket buffers freed at locations in the kernel. Use the --all-modules option to see symbolic names.
    • latencytap.stp: This script records the effect that different types of latency have on one or more processes. It prints a list of latency types every 30 seconds, sorted in descending order by the total time the process or processes spent waiting. This can be useful for identifying the cause of both storage and network latency.

    Red Hat recommends using the --all-modules option with this script to better enable the mapping of latency events. By default, this script is installed in the /usr/share/systemtap/examples/profiling directory.

  • BPF Compiler Collection (BCC) is a library, which facilitates the creation of the extended Berkeley Packet Filter (eBPF) programs. The main utility of the eBPF programs is analyzing OS performance and network performance without experiencing overhead or security issues.

Additional resources

31.1.2. Bottlenecks in a packet reception

While the network stack is largely self-optimizing, there are a number of points during network packet processing that can become bottlenecks and reduce the performance.

The following are the issues that can cause bottleneck:

The buffer or ring buffer of the network card
The hardware buffer can be a bottleneck if the kernel drops a large number of packets. Use the ethtool utility for monitoring a system for dropped packets.
The hardware or software interrupt queues
Interrupts can increase latency and processor contention. For information on how the processor handles interrupts, see Overview of an interrupt request, Balancing interrupts manually, and Setting the smp_affinity mask.
The socket receive queue of the application
A large number of packets that are not copied or by an increase in the UDP input errors (InErrors) in the /proc/net/snmp file, indicates a bottleneck in an application’s receive queue.

If the hardware buffer drops a large number of packets, the following are the few potential solutions:

Slow the input traffic
Filter the incoming traffic, reduce the number of joined multicast groups, or reduce the amount of broadcast traffic to decrease the rate at which the queue fills.
Resize the hardware buffer queue

Resize the hardware buffer queue: Reduce the number of packets being dropped by increasing the size of the queue so that it does not overflow as easily. You can modify the rx/tx parameters of the network device with the ethtool command:

ethtool --set-ring device-name value

Change the drain rate of the queue
  • Decrease the rate at which the queue fills by filtering or dropping packets before they reach the queue, or by lowering the weight of the device. Filter incoming traffic or lower the network interface card’s device weight to slow incoming traffic.

    The device weight refers to the number of packets a device can receive at one time in a single scheduled processor access. You can increase the rate at which a queue is drained by increasing its device weight that is controlled by the dev_weight kernel setting. To temporarily alter this parameter, change the contents of the /proc/sys/net/core/dev_weight file, or to permanently alter, use the sysctl command, which is provided by the procps-ng package.

  • Increase the length of the application’s socket queue: This is typically the easiest way to improve the drain rate of a socket queue, but it is unlikely to be a long-term solution. If a socket queue receives a limited amount of traffic in bursts, increasing the depth of the socket queue to match the size of the bursts of traffic may prevent packets from being dropped. To increase the depth of a queue, increase the size of the socket receive buffer by making either of the following changes:

    • Increase the value of the /proc/sys/net/core/rmem_default parameter: This parameter controls the default size of the receive buffer used by sockets. This value must be smaller than or equal to the value of the /proc/sys/net/core/rmem_max parameter.
    • Use the setsockopt to configure a larger SO_RCVBUF value: This parameter controls the maximum size in bytes of a socket’s receive buffer. Use the getsockopt system call to determine the current value of the buffer.

Altering the drain rate of a queue is usually the simplest way to mitigate poor network performance. However, increasing the number of packets that a device can receive at one time uses additional processor time, during which no other processes can be scheduled, so this can cause other performance problems.

Additional resources

  • ss(8), socket(7), and ethtool(8) man pages
  • /proc/net/snmp file

31.1.3. Busy polling

If analysis reveals high latency, your system may benefit from the poll-based rather than interrupt-based packet receipt.

Busy polling helps to reduce latency in the network receive path by allowing socket layer code to poll the receive queue of a network device, and disables network interrupts. This removes delays caused by the interrupt and the resultant context switch. However, it also increases CPU utilization. Busy polling also prevents the CPU from sleeping, which can incur additional power consumption. Busy polling behavior is supported by all the device drivers.

Additional resources

31.1.3.1. Enabling busy polling

By default, the busy polling is disabled. This procedure describes how to enable busy polling.

Procedure

  1. Ensure if the CONFIG_NET_RX_BUSY_POLL compilation option is enabled:

    # cat /boot/config-$(uname -r) | grep CONFIG_NET_RX_BUSY_POLL
    CONFIG_NET_RX_BUSY_POLL=y
  2. Enable busy polling

    1. To enable busy polling on specific sockets, set the sysctl.net.core.busy_poll kernel value to a value other than 0:

      # echo "net.core.busy_poll=50" > /etc/sysctl.d/95-enable-busy-polling-for-sockets.conf
      # sysctl -p /etc/sysctl.d/95-enable-busy-polling-for-sockets.conf

      This parameter controls the number of microseconds to wait for packets on the socket poll and select syscalls. Red Hat recommends a value of 50.

    2. Add the SO_BUSY_POLL socket option to the socket.
    3. To enable busy polling globally, set the sysctl.net.core.busy_read to a value other than 0:

      # echo "net.core.busy_read=50" > /etc/sysctl.d/95-enable-busy-polling-globally.conf
      # sysctl -p /etc/sysctl.d/95-enable-busy-polling-globally.conf

      The net.core.busy_read parameter controls the number of microseconds to wait for packets on the device queue for socket reads. It also sets the default value of the SO_BUSY_POLL option. Red Hat recommends a value of 50 for a small number of sockets, and a value of 100 for large numbers of sockets. For extremely large numbers of sockets, for example more than several hundred, use the epoll system call instead.

Additional resources

31.1.4. Receive-Side Scaling

Receive-Side Scaling (RSS), also known as multi-queue receive, distributes network receive processing across several hardware-based receive queues, allowing inbound network traffic to be processed by multiple CPUs. RSS can be used to relieve bottlenecks in receive interrupt processing caused by overloading a single CPU, and to reduce network latency. By default, RSS is enabled.

The number of queues or the CPUs that should process network activity for RSS are configured in the appropriate network device driver:

  • For the bnx2x driver, it is configured in the num_queues parameter.
  • For the sfc driver, it is configured in the rss_cpus parameter.

Regardless, it is typically configured in the /sys/class/net/device/queues/rx-queue/ directory, where device is the name of the network device (such as enp1s0) and rx-queue is the name of the appropriate receive queue.

The irqbalance daemon can be used in conjunction with RSS to reduce the likelihood of cross-node memory transfers and cache line bouncing. This lowers the latency of processing network packets.

31.1.4.1. Viewing the interrupt request queues

When configuring Receive-Side Scaling (RSS), Red Hat recommends limiting the number of queues to one per physical CPU core. Hyper-threads are often represented as separate cores in analysis tools, but configuring queues for all cores including logical cores such as hyper-threads has not proven beneficial to network performance.

When enabled, RSS distributes network processing equally between available CPUs based on the amount of processing each CPU has queued. However, use the --show-rxfh-indir and --set-rxfh-indir parameters of the ethtool utility, to modify how RHEL distributes network activity, and weigh certain types of network activity as more important than others.

This procedure describes how to view the interrupt request queues.

Procedure

  • To determine whether your network interface card supports RSS, check whether multiple interrupt request queues are associated with the interface in /proc/interrupts:

    # egrep 'CPU|p1p1' /proc/interrupts
     CPU0    CPU1    CPU2    CPU3    CPU4    CPU5
    89:   40187       0       0       0       0       0   IR-PCI-MSI-edge   p1p1-0
    90:       0     790       0       0       0       0   IR-PCI-MSI-edge   p1p1-1
    91:       0       0     959       0       0       0   IR-PCI-MSI-edge   p1p1-2
    92:       0       0       0    3310       0       0   IR-PCI-MSI-edge   p1p1-3
    93:       0       0       0       0     622       0   IR-PCI-MSI-edge   p1p1-4
    94:       0       0       0       0       0    2475   IR-PCI-MSI-edge   p1p1-5

    The output shows that the NIC driver created 6 receive queues for the p1p1 interface (p1p1-0 through p1p1-5). It also shows how many interrupts were processed by each queue, and which CPU serviced the interrupt. In this case, there are 6 queues because by default, this particular NIC driver creates one queue per CPU, and this system has 6 CPUs. This is a fairly common pattern among NIC drivers.

  • To list the interrupt request queue for a PCI device with the address 0000:01:00.0:

    # ls -1 /sys/devices/*/*/0000:01:00.0/msi_irqs
    101
    102
    103
    104
    105
    106
    107
    108
    109

31.1.5. Receive Packet Steering

Receive Packet Steering (RPS) is similar to Receive-Side Scaling (RSS) in that it is used to direct packets to specific CPUs for processing. However, RPS is implemented at the software level, and helps to prevent the hardware queue of a single network interface card from becoming a bottleneck in network traffic. By default, RPS is disabled.

RPS has several advantages over hardware-based RSS:

  • RPS can be used with any network interface card.
  • It is easy to add software filters to RPS to deal with new protocols.
  • RPS does not increase the hardware interrupt rate of the network device. However, it does introduce inter-processor interrupts.

RPS is configured per network device and receive queue, in the /sys/class/net/device/queues/rx-queue/rps_cpus file, where device is the name of the network device, such as enp1s0 and rx-queue is the name of the appropriate receive queue, such as rx-0.

The default value of the rps_cpus file is 0. This disables RPS, and the CPU handles the network interrupt and also processes the packet. To enable RPS, configure the appropriate rps_cpus file with the CPUs that should process packets from the specified network device and receive queue.

The rps_cpus files use comma-delimited CPU bitmaps. Therefore, to allow a CPU to handle interrupts for the receive queue on an interface, set the value of their positions in the bitmap to 1. For example, to handle interrupts with CPUs 0, 1, 2, and 3, set the value of the rps_cpus to f, which is the hexadecimal value for 15. In binary representation, 15 is 00001111 (1+2+4+8).

For network devices with single transmit queues, best performance can be achieved by configuring RPS to use CPUs in the same memory domain. On non-NUMA systems, this means that all available CPUs can be used. If the network interrupt rate is extremely high, excluding the CPU that handles network interrupts may also improve performance.

For network devices with multiple queues, there is typically no benefit to configure both RPS and RSS, as RSS is configured to map a CPU to each receive queue by default. However, RPS can still be beneficial if there are fewer hardware queues than CPUs, and RPS is configured to use CPUs in the same memory domain.

31.1.6. Receive Flow Steering

Receive Flow Steering (RFS) extends Receive Packet Steering (RPS) behavior to increase the CPU cache hit rate and thereby reduce network latency. Where RPS forwards packets based solely on queue length, RFS uses the RPS back end to calculate the most appropriate CPU, then forwards packets based on the location of the application consuming the packet. This increases CPU cache efficiency.

Data received from a single sender is not sent to more than one CPU. If the amount of data received from a single sender is greater than a single CPU can handle, configure a larger frame size to reduce the number of interrupts and therefore the amount of processing work for the CPU. Alternatively, consider NIC offload options or faster CPUs.

Consider using numactl or taskset in conjunction with RFS to pin applications to specific cores, sockets, or NUMA nodes. This can help prevent packets from being processed out of order.

31.1.6.1. Enabling Receive Flow Steering

By default, Receive Flow Steering (RFS) is disabled. This procedure describes how to enable RFS.

Procedure

  1. Set the value of the net.core.rps_sock_flow_entries kernel value to the maximum expected number of concurrently active connections:

    # echo "net.core.rps_sock_flow_entries=32768" > /etc/sysctl.d/95-enable-rps.conf
    Note

    Red Hat recommends a value of 32768 for moderate server loads. All values entered are rounded up to the nearest power of 2 in practice.

  2. Persistently set the value of the net.core.rps_sock_flow_entries:

    # sysctl -p /etc/sysctl.d/95-enable-rps.conf
  3. To temporarily set the value of the sys/class/net/device/queues/rx-queue/rps_flow_cnt file to the value of the (rps_sock_flow_entries/N), where N is the number of receive queues on a device:

    # echo 2048 > /sys/class/net/device/queues/rx-queue/rps_flow_cnt

    Replace device with the name of the network device you wish to configure (for example, enp1s0), and rx-queue with the receive queue you wish to configure (for example, rx-0).

    Replace N with the number of configured receive queues. For example, if the rps_flow_entries is set to 32768 and there are 16 configured receive queues, the rps_flow_cnt = 32786/16= 2048 (that is, rps_flow_cnt = rps_flow_enties/N ).

    For single-queue devices, the value of rps_flow_cnt is the same as the value of rps_sock_flow_entries.

  4. Persistently enable RFS on all network devices, create the /etc/udev/rules.d/99-persistent-net.rules file, and add the following content:

    SUBSYSTEM=="net", ACTION=="add", RUN{program}+="/bin/bash -c 'for x in /sys/$DEVPATH/queues/rx-*; do echo 2048 > $x/rps_flow_cnt;  done'"
  5. Optional: To enable RPS on a specific network device:

    SUBSYSTEM=="net", ACTION=="move", NAME="device name" RUN{program}+="/bin/bash -c 'for x in /sys/$DEVPATH/queues/rx-*; do echo 2048 > $x/rps_flow_cnt; done'"

    Replace device name with the actual network device name.

Verification steps

  • Verify if RFS is enabled:

    # cat /proc/sys/net/core/rps_sock_flow_entries
    32768
    
    # cat /sys/class/net/device/queues/rx-queue/rps_flow_cnt
    2048

Additional resources

  • sysctl(8) man page

31.1.7. Accelerated RFS

Accelerated RFS boosts the speed of Receive Flow Steering (RFS) by adding hardware assistance. Like RFS, packets are forwarded based on the location of the application consuming the packet.

Unlike traditional RFS, however, packets are sent directly to a CPU that is local to the thread consuming the data:

  • either the CPU that is executing the application
  • or a CPU local to that CPU in the cache hierarchy

Accelerated RFS is only available if the following conditions are met:

  • NIC must support the accelerated RFS. Accelerated RFS is supported by cards that export the ndo_rx_flow_steer() net_device function. Check the NIC’s data sheet to ensure if this feature is supported.
  • ntuple filtering must be enabled. For information on how to enable these filters, see Enabling the ntuple filters.

Once these conditions are met, CPU to queue mapping is deduced automatically based on traditional RFS configuration. That is, CPU to queue mapping is deduced based on the IRQ affinities configured by the driver for each receive queue. For more information on enabling the traditional RFS, see Enabling Receive Flow Steering.

31.1.7.1. Enabling the ntuple filters

The ntuple filtering must be enabled. Use the ethtool -k command to enable the ntuple filters.

Procedure

  1. Display the current status of the ntuple filter:

    # ethtool -k enp1s0 | grep ntuple-filters
    
    ntuple-filters: off
  2. Enable the ntuple filters:

    # ethtool -k enp1s0 ntuple on
Note

If the output is ntuple-filters: off [fixed], then the ntuple filtering is disabled and you cannot configure it:

# ethtool -k enp1s0 | grep ntuple-filters
ntuple-filters: off [fixed]

Verification steps

  • Ensure if ntuple filters are enabled:

    # ethtool -k enp1s0 | grep ntuple-filters
    ntuple-filters: on

Additional resources

  • ethtool(8) man page