Chapter 41. Understanding the eBPF networking features in RHEL 9

The extended Berkeley Packet Filter (eBPF) is an in-kernel virtual machine that allows code execution in the kernel space. This code runs in a restricted sandbox environment with access only to a limited set of functions.

In networking, you can use eBPF to complement or replace kernel packet processing. Depending on the hook you use, eBPF programs have, for example:

  • Read and write access to packet data and metadata
  • Can look up sockets and routes
  • Can set socket options
  • Can redirect packets

41.1. Overview of networking eBPF features in RHEL 9

You can attach extended Berkeley Packet Filter (eBPF) networking programs to the following hooks in RHEL:

  • eXpress Data Path (XDP): Provides early access to received packets before the kernel networking stack processes them.
  • tc eBPF classifier with direct-action flag: Provides powerful packet processing on ingress and egress.
  • Control Groups version 2 (cgroup v2): Enables filtering and overriding socket-based operations performed by programs in a control group.
  • Socket filtering: Enables filtering of packets received from sockets. This feature was also available in the classic Berkeley Packet Filter (cBPF), but has been extended to support eBPF programs.
  • Stream parser: Enables splitting up streams to individual messages, filtering, and redirecting them to sockets.
  • SO_REUSEPORT socket selection: Provides a programmable selection of a receiving socket from a reuseport socket group.
  • Flow dissector: Enables overriding the way the kernel parses packet headers in certain situations.
  • TCP congestion control callbacks: Enables implementing a custom TCP congestion control algorithm.
  • Routes with encapsulation: Enables creating custom tunnel encapsulation.

XDP

You can attach programs of the BPF_PROG_TYPE_XDP type to a network interface. The kernel then executes the program on received packets before the kernel network stack starts processing them. This allows fast packet forwarding in certain situations, such as fast packet dropping to prevent distributed denial of service (DDoS) attacks and fast packet redirects for load balancing scenarios.

You can also use XDP for different forms of packet monitoring and sampling. The kernel allows XDP programs to modify packets and to pass them for further processing to the kernel network stack.

The following XDP modes are available:

  • Native (driver) XDP: The kernel executes the program from the earliest possible point during packet reception. At this moment, the kernel did not parse the packet and, therefore, no metadata provided by the kernel is available. This mode requires that the network interface driver supports XDP but not all drivers support this native mode.
  • Generic XDP: The kernel network stack executes the XDP program early in the processing. At that time, kernel data structures have been allocated, and the packet has been pre-processed. If a packet should be dropped or redirected, it requires a significant overhead compared to the native mode. However, the generic mode does not require network interface driver support and works with all network interfaces.
  • Offloaded XDP: The kernel executes the XDP program on the network interface instead of on the host CPU. Note that this requires specific hardware, and only certain eBPF features are available in this mode.

On RHEL, load all XDP programs using the libxdp library. This library enables system-controlled usage of XDP.

Note

Currently, there are some system configuration limitations for XDP programs. For example, you must disable certain hardware offload features on the receiving interface. Additionally, not all features are available with all drivers that support the native mode.

In RHEL 9, Red Hat supports the XDP features only if you use the libxdp library to load the program into the kernel.

AF_XDP

Using an XDP program that filters and redirects packets to a given AF_XDP socket, you can use one or more sockets from the AF_XDP protocol family to quickly copy packets from the kernel to the user space.

Traffic Control

The Traffic Control (tc) subsystem offers the following types of eBPF programs:

  • BPF_PROG_TYPE_SCHED_CLS
  • BPF_PROG_TYPE_SCHED_ACT

These types enable you to write custom tc classifiers and tc actions in eBPF. Together with the parts of the tc ecosystem, this provides the ability for powerful packet processing and is the core part of several container networking orchestration solutions.

In most cases, only the classifier is used, as with the direct-action flag, the eBPF classifier can execute actions directly from the same eBPF program. The clsact Queueing Discipline (qdisc) has been designed to enable this on the ingress side.

Note that using a flow dissector eBPF program can influence operation of some other qdiscs and tc classifiers, such as flower.

Socket filter

Several utilities use or have used the classic Berkeley Packet Filter (cBPF) for filtering packets received on a socket. For example, the tcpdump utility enables the user to specify expressions, which tcpdump then translates into cBPF code.

As an alternative to cBPF, the kernel allows eBPF programs of the BPF_PROG_TYPE_SOCKET_FILTER type for the same purpose.

Control Groups

In RHEL, you can use multiple types of eBPF programs that you can attach to a cgroup. The kernel executes these programs when a program in the given cgroup performs an operation. Note that you can use only cgroups version 2.

The following networking-related cgroup eBPF programs are available in RHEL:

  • BPF_PROG_TYPE_SOCK_OPS: The kernel calls this program on various TCP events. The program can adjust the behavior of the kernel TCP stack, including custom TCP header options, and so on.
  • BPF_PROG_TYPE_CGROUP_SOCK_ADDR: The kernel calls this program during connect, bind, sendto, recvmsg, getpeername, and getsockname operations. This program allows changing IP addresses and ports. This is useful when you implement socket-based network address translation (NAT) in eBPF.
  • BPF_PROG_TYPE_CGROUP_SOCKOPT: The kernel calls this program during setsockopt and getsockopt operations and allows changing the options.
  • BPF_PROG_TYPE_CGROUP_SOCK: The kernel calls this program during socket creation, socket releasing, and binding to addresses. You can use these programs to allow or deny the operation, or only to inspect socket creation for statistics.
  • BPF_PROG_TYPE_CGROUP_SKB: This program filters individual packets on ingress and egress, and can accept or reject packets.
  • BPF_PROG_TYPE_CGROUP_SYSCTL: This program allows filtering of access to system controls (sysctl).

Stream Parser

A stream parser operates on a group of sockets that are added to a special eBPF map. The eBPF program then processes packets that the kernel receives or sends on those sockets.

The following stream parser eBPF programs are available in RHEL:

  • BPF_PROG_TYPE_SK_SKB: An eBPF program parses packets received from the socket into individual messages, and instructs the kernel to drop those messages or send them to another socket in the group.
  • BPF_PROG_TYPE_SK_MSG: This program filters egress messages. An eBPF program parses the packets into individual messages and either approves or rejects them.

SO_REUSEPORT socket selection

Using this socket option, you can bind multiple sockets to the same IP address and port. Without eBPF, the kernel selects the receiving socket based on a connection hash. With the BPF_PROG_TYPE_SK_REUSEPORT program, the selection of the receiving socket is fully programmable.

Flow dissector

When the kernel needs to process packet headers without going through the full protocol decode, they are dissected. For example, this happens in the tc subsystem, in multipath routing, in bonding, or when calculating a packet hash. In this situation the kernel parses the packet headers and fills internal structures with the information from the packet headers. You can replace this internal parsing using the BPF_PROG_TYPE_FLOW_DISSECTOR program. Note that you can only dissect TCP and UDP over IPv4 and IPv6 in eBPF in RHEL.

TCP Congestion Control

You can write a custom TCP congestion control algorithm using a group of BPF_PROG_TYPE_STRUCT_OPS programs that implement struct tcp_congestion_oops callbacks. An algorithm that is implemented this way is available to the system alongside the built-in kernel algorithms.

Routes with encapsulation

You can attach one of the following eBPF program types to routes in the routing table as a tunnel encapsulation attribute:

  • BPF_PROG_TYPE_LWT_IN
  • BPF_PROG_TYPE_LWT_OUT
  • BPF_PROG_TYPE_LWT_XMIT

The functionality of such an eBPF program is limited to specific tunnel configurations and does not allow creating a generic encapsulation or decapsulation solution.

Socket lookup

To bypass limitations of the bind system call, use an eBPF program of the BPF_PROG_TYPE_SK_LOOKUP type. Such programs can select a listening socket for new incoming TCP connections or an unconnected socket for UDP packets.

41.2. Overview of XDP features in RHEL 9 by network cards

The following is an overview of XDP-enabled network cards and the XDP features you can use with them:

Network cardDriverBasicRedirectTargetHW offloadZero-copyLarge MTU

Amazon Elastic Network Adapter

ena

yes

yes

yes [a]

no

no

no

aQuantia AQtion Ethernet card

atlantic

yes

yes

no

no

no

no

Broadcom NetXtreme-C/E 10/25/40/50 gigabit Ethernet

bnxt_en

yes

yes

yes [a]

no

no

yes

Cavium Thunder Virtual function

nicvf

yes

no

no

no

no

no

Google Virtual NIC (gVNIC) support

gve

yes

yes

yes

no

yes

no

Intel® 10GbE PCI Express Virtual Function Ethernet

ixgbevf

yes

no

no

no

no

no

Intel® 10GbE PCI Express adapters

ixgbe

yes

yes

yes [a]

no

yes

yes [b]

Intel® Ethernet Connection E800 Series

ice

yes

yes

yes [a]

no

yes

yes

Intel® Ethernet Controller I225-LM/I225-V family

igc

yes

yes

yes

no

yes

yes [b]

Intel® PCI Express Gigabit adapters

igb

yes

yes

yes [a]

no

no

yes [b]

Intel® Ethernet Controller XL710 Family

i40e

yes

yes

yes [a] [c]

no

yes

no

Marvell OcteonTX2

rvu_nicpf

yes

yes

yes [a] [c]

no

no

no

Mellanox 5th generation network adapters (ConnectX series)

mlx5_core

yes

yes

yes [c]

no

yes

yes

Mellanox Technologies 1/10/40Gbit Ethernet

mlx4_en

yes

yes

no

no

no

no

Microsoft Azure Network Adapter

mana

yes

yes

yes

no

no

no

Microsoft Hyper-V virtual network

hv_netvsc

yes

yes

yes

no

no

no

Netronome® NFP4000/NFP6000 NIC [d]

nfp

yes

no

no

yes

yes

no

QEMU Virtio network

virtio_net

yes

yes

yes [a]

no

no

yes

QLogic QED 25/40/100Gb Ethernet NIC

qede

yes

yes

yes

no

no

no

STMicroelectronics Multi-Gigabit Ethernet

stmmac

yes

yes

yes

no

yes

no

Solarflare SFC9000/SFC9100/EF100-family

sfc

yes

yes

yes [c]

no

no

no

Universal TUN/TAP device

tun

yes

yes

yes

no

no

no

Virtual Ethernet pair device

veth

yes

yes

yes

no

no

yes

VMware VMXNET3 ethernet driver

vmxnet3

yes

yes

yes [a] [c]

no

no

no

Xen paravirtual network device

xen-netfront

yes

yes

yes

no

no

no

[a] Only if an XDP program is loaded on the interface.
[b] Transmitting side only. Cannot receive large packets through XDP.
[c] Requires several XDP TX queues allocated that is larger or equal to the largest CPU index.
[d] Some of the listed features are not available for the Netronome® NFP3800 NIC.

Legend:

  • Basic: Supports basic return codes: DROP, PASS, ABORTED, and TX.
  • Redirect: Supports the XDP_REDIRECT return code.
  • Target: Can be a target of a XDP_REDIRECT return code.
  • HW offload: Supports XDP hardware offload.
  • Zero-copy: Supports the zero-copy mode for the AF_XDP protocol family.
  • Large MTU: Supports packets larger than page size.