A detailed view of the vhost user protocol and its implementation in OVS DPDK, qemu and virtio-net

Solution Verified - Updated -

Environment

Red Hat OpenStack Platform 10
Open vSwitch 2.6.1

Issue

A detailed view of the vhost user protocol and its implementation in OVS DPDK, qemu and virtio-net

Resolution

Disclaimer: Links contained herein to external website(s) are provided for convenience only. Red Hat has not reviewed the links and is not responsible for the content or its availability. The inclusion of any link to an external website does not imply endorsement by Red Hat of the website or their entities, products or services. You agree that Red Hat is not responsible or liable for any loss or expenses that may result due to your use of (or reliance on) the external site or content.

A detailed view of the vhost user protocol and its implementation in OVS DPDK, qemu and virtio-net

Overview: How OVS DPDK and qemu communicate via the vhost user protocol

The vhost user protocol consists of a control path and a data path.

  • All control information is exchanged via a Unix socket. This includes information for exchanging memory mappings for direct memory access, as well as kicking / interrupting the other side if data is put into the virtio queue. The Unix socket, in neutron, is named vhuxxxxxxxx-xx.

  • The actual dataplane is implemented via direct memory access. The virtio-net driver within the guest allocates part of the instance memory for the virtio queue. The structure of this queue is standardized in the virtio standard. Qemu shares this memory section’s address with OVS DPDK over the control channel. DPDK itself then maps the same standardized virtio queue structure onto this memory section and can thus directly read from and write to the virtio queue within the instance’s hugepage memory. This direct memory access is one of the reasons why both OVS DPDK and qemu need to use hugepage memory. If qemu is otherwise set up correctly, but lacks configuration for huge page memory, then OVS DPDK will not be able to access qemu’s memory and hence no packets can be exchanged. Users will notice this if they forget to request instance hugepages via nova’s metadata.

When OVS DPDK transmits towards the instance, these packets will show up within OVS DPDK’s statistics as Tx on port vhuxxxxxxxx-xx. Within the instance, these packets show up as Rx.

When the instance transmits packets to OVS DPDK, then on the instance, these packets show up as Tx, and on OVS DPDK’s vhuxxxxxxxx-xx port, they show up as Rx.

Note that the instance does not have “hardware” counters. ethtool -s is not implemented. All low level counters do only show up within OVS (ovs-vsctl list get interfave vhuxxxxxxxx-xx statistics) and report OVS DPDK’s perspective.

Although packets can be directly transmitted via shared memory, either side needs a means to tell the opposite side that a packet was copied into the virtio queue. This happens by kicking the other side over the control plane which is implemented with the vhost user socket vhuxxxxxxxx-xx. Kicking the other side comes at a cost. Firstly, a system call is needed to write to the socket. Secondly, an interrupt will have to be processed by the other side. Hence both sender and receiver spend costly extra time within the control channel.

In order to avoid costly kicks via the control plane, both Open vSwitch and qemu can set specific flags to signal to the other side that they do not wish to receive an interrupt. However, they can only do so if they either temporarily or constantly poll the virtio queue.

For instance network performance this means that the optimal means of packet processing is DPDK within the instance itself. While Linux kernel networking (NAPI) uses a mix of interrupt and poll mode processing, it is still exposed to a high number of interrupts. OVS DPDK sends packets towards the instance at very high rates. At the same time, the RX and TX buffers of qemu’s virtio queue are limited to a default of 256 and a maximum of 1024 entries. As a consequence, the instance itself needs to process packets very quickly. This is ideally achieved by constantly polling with a DPDK PMD on the instance’s interface.

The vhost user protocol

https://github.com/qemu/qemu/blob/master/docs/interop/vhost-user.txt

Vhost-user Protocol
===================

Copyright (c) 2014 Virtual Open Systems Sarl.

This work is licensed under the terms of the GNU GPL, version 2 or later.
See the COPYING file in the top-level directory.
===================

This protocol is aiming to complement the ioctl interface used to control the
vhost implementation in the Linux kernel. It implements the control plane needed
to establish virtqueue sharing with a user space process on the same host. It
uses communication over a Unix domain socket to share file descriptors in the
ancillary data of the message.

The protocol defines 2 sides of the communication, master and slave. Master is
the application that shares its virtqueues, in our case QEMU. Slave is the
consumer of the virtqueues.

In the current implementation QEMU is the Master, and the Slave is intended to
be a software Ethernet switch running in user space, such as Snabbswitch.

Master and slave can be either a client (i.e. connecting) or server (listening)
in the socket communication.

vhost user has since 2 sides:

  • Master - qemu

  • Slave - Open vSwitch or any other software switch

vhost user can run in 2 modes:

  • vhostuser-client - qemu is the server, the software switch is the client

  • vhostuser - the software switch is the server, qemu is the client

vhost user is based on the vhost architecture and implements all features in user space.

When a qemu instance boots, it will allocate all of the instance memory as shared hugepages. The OS' virtio paravirtualized driver will reserve part of this hugepage memory for holding the virtio ring buffer. This allows OVS DPDK to directly read from and write into the instance's virtio ring. Both OVS DPDK and qemu can directly exchange packets across this reserved memory section.

"The user space application will receive file descriptors for the pre-allocated shared guest RAM. It will directly access the related vrings in the guest's memory space" (http://www.virtualopensystems.com/en/solutions/guides/snabbswitch-qemu/).

For example, look at the following VM, mode vhostuser:

qemu      528828  0.1  0.0 2920084 34188 ?       Sl   Mar28   1:45 /usr/libexec/qemu-kvm -name guest=instance-00000028,debug-threads=on -S -object secret,id=masterKey0,format=raw,file=/var/lib/libvirt/qemu/domain-58-instance-00000028/master-key.aes -machine pc-i440fx-rhel7.4.0,accel=kvm,usb=off,dump-guest-core=off -cpu Skylake-Client,ss=on,hypervisor=on,tsc_adjust=on,pdpe1gb=on,mpx=off,xsavec=off,xgetbv1=off -m 2048 -realtime mlock=off -smp 8,sockets=4,cores=1,threads=2 -object memory-backend-file,id=ram-node0,prealloc=yes,mem-path=/dev/hugepages/libvirt/qemu/58-instance-00000028,share=yes,size=1073741824,host-nodes=0,policy=bind -numa node,nodeid=0,cpus=0-3,memdev=ram-node0 -object memory-backend-file,id=ram-node1,prealloc=yes,mem-path=/dev/hugepages/libvirt/qemu/58-instance-00000028,share=yes,size=1073741824,host-nodes=1,policy=bind -numa node,nodeid=1,cpus=4-7,memdev=ram-node1 -uuid 48888226-7b6b-415c-bcf7-b278ba0bca62 -smbios type=1,manufacturer=Red Hat,product=OpenStack Compute,version=14.1.0-3.el7ost,serial=3d5e138a-8193-41e4-ac95-de9bfc1a3ef1,uuid=48888226-7b6b-415c-bcf7-b278ba0bca62,family=Virtual Machine -no-user-config -nodefaults -chardev socket,id=charmonitor,path=/var/lib/libvirt/qemu/domain-58-instance-00000028/monitor.sock,server,nowait -mon chardev=charmonitor,id=monitor,mode=control -rtc base=utc,driftfix=slew -global kvm-pit.lost_tick_policy=delay -no-hpet -no-shutdown -boot strict=on -device piix3-usb-uhci,id=usb,bus=pci.0,addr=0x1.0x2 -drive file=/var/lib/nova/instances/48888226-7b6b-415c-bcf7-b278ba0bca62/disk,format=qcow2,if=none,id=drive-virtio-disk0,cache=none -device virtio-blk-pci,scsi=off,bus=pci.0,addr=0x4,drive=drive-virtio-disk0,id=virtio-disk0,bootindex=1 -chardev socket,id=charnet0,path=/var/run/openvswitch/vhuc26fd3c6-4b -netdev vhost-user,chardev=charnet0,queues=8,id=hostnet0 -device virtio-net-pci,mq=on,vectors=18,netdev=hostnet0,id=net0,mac=fa:16:3e:52:30:73,bus=pci.0,addr=0x3 -add-fd set=0,fd=33 -chardev file,id=charserial0,path=/dev/fdset/0,append=on -device isa-serial,chardev=charserial0,id=serial0 -chardev pty,id=charserial1 -device isa-serial,chardev=charserial1,id=serial1 -device usb-tablet,id=input0,bus=usb.0,port=1 -vnc 172.16.2.10:1 -k en-us -device cirrus-vga,id=video0,bus=pci.0,addr=0x2 -device virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x5 -msg timestamp=on

Qemu is instructed to allocate memory from the huge page pool and to make it shared memory (share=yes):

-object memory-backend-file,id=ram-node0,prealloc=yes,mem-path=/dev/hugepages/libvirt/qemu/58-instance-00000028,share=yes,size=1073741824,host-nodes=0,policy=bind -numa node,nodeid=0,cpus=0-3,memdev=ram-node0 -object memory-backend-file,id=ram-node1,prealloc=yes,mem-path=/dev/hugepages/libvirt/qemu/58-instance-00000028,share=yes,size=1073741824,host-nodes=1,policy=bind

Simply copying packets into the other party's buffer is not enough, however. Additionally, vhost user uses a Unix domain socket (vhu[a-f0-9-]) for communication between the vswitch and qemu, both during initialization and to kick the other side when packets were copied into the virtio ring in shared memory. Interaction hence consists of a control path (vhu socket) for setup and notification and a datapath (direct memory access) for moving the actual payload.

For the described Virtio mechanism to work, we need a setup interface to initialize the shared memory regions and exchange the event file descriptors. A Unix domain socket implements an API which allows us to do that. This straightforward socket interface can be used to initialize the userspace Virtio transport (vhost-user), in particular:

* Vrings are determined at initialization and are placed in shared memory between the two processed.

* For Virtio events (Vring kicks) we shall use eventfds that map to Vring events. This allows us compatibility with the QEMU/KVM implementation described in the next chapter, since KVM allows us to match events coming from virtio_pci in the guest with eventfds (ioeventfd and irqfd).

Sharing file descriptors between two processes differs than sharing them between a process and the kernel. One needs to use sendmsg over a Unix domain socket with SCM_RIGHTS set.

(http://www.virtualopensystems.com/en/solutions/guides/snabbswitch-qemu/)

In vhostuser mode, OVS creates the vhu socket and qemu connects to it. in vhostuser client mode, qemu creates the vhu socket and OVS connects to it.

In the above example instance with vhostuser mode, qemu is instructed to connect a netdev of type vhost-user to /var/run/openvswitch/vhuc26fd3c6-4b:

-chardev socket,id=charnet0,path=/var/run/openvswitch/vhuc26fd3c6-4b -netdev vhost-user,chardev=charnet0,queues=8,id=hostnet0 -device virtio-net-pci,mq=on,vectors=18,netdev=hostnet0,id=net0,mac=fa:16:3e:52:30:73,bus=pci.0,addr=0x3

lsof reveals that the socket is created by OVS:

[root@overcloud-compute-0 ~]# lsof -nn | grep vhuc26fd3c6-4b | awk '{print $1}' | uniq
ovs-vswit
vfio-sync
eal-intr-
lcore-sla
dpdk_watc
vhost_thr
ct_clean3
urcu4
handler12
handler13
handler14
handler15
revalidat
pmd189
pmd182
pmd187
pmd184
pmd185
pmd186
pmd183
pmd188

When a packet is copied into the virtio ring in shared memory by one of the participants, the other side either

  • does currently (e.g. Linux kernel's NAPI) or constantly (e.g. DPDK's PMD) poll the queue in case of which it will pick up new packets without further notice.

  • does not poll the queue and must be notified of the arrival of packets.

For the second case, the instance can be kicked via the separate control path across the vhu socket. The control path implements interrupts in user space by exchanging eventfd objects. Note that writing to the socket requires system calls and will cause the PMDs to spend time in kernel space. The VM can switch off the control path by setting the VRING_AVAIL_F_NO_INTERRUPT flag. Otherwise, Open vSwitch will kick (interrupt) the VM whenever it puts new packets into the virtio ring.

Further details can be found in the following blog post: http://blog.vmsplice.net/2011/09/qemu-internals-vhost-architecture.html

Vhost as a userspace interface

One surprising aspect of the vhost architecture is that it is not tied to KVM in any way. Vhost is a userspace interface and has no dependency on the KVM kernel module. This means other userspace code, like libpcap, could in theory use vhost devices if they find them convenient high-performance I/O interfaces.

When a guest kicks the host because it has placed buffers onto a virtqueue, there needs to be a way to signal the vhost worker thread that there is work to do. Since vhost does not depend on the KVM kernel module they cannot communicate directly. Instead vhost instances are set up with an eventfd file descriptor which the vhost worker thread watches for activity. The KVM kernel module has a feature known as ioeventfd for taking an eventfd and hooking it up to a particular guest I/O exit. QEMU userspace registers an ioeventfd for the VIRTIO_PCI_QUEUE_NOTIFY hardware register access which kicks the virtqueue. This is how the vhost worker thread gets notified by the KVM kernel module when the guest kicks the virtqueue.

On the return trip from the vhost worker thread to interrupting the guest a similar approach is used. Vhost takes a "call" file descriptor which it will write to in order to kick the guest. The KVM kernel module has a feature called irqfd which allows an eventfd to trigger guest interrupts. QEMU userspace registers an irqfd for the virtio PCI device interrupt and hands it to the vhost instance. This is how the vhost worker thread can interrupt the guest.

In the end the vhost instance only knows about the guest memory mapping, a kick eventfd, and a call eventfd.
Where to find out more
Here are the main points to begin exploring the code:

    drivers/vhost/vhost.c - common vhost driver code
    drivers/vhost/net.c - vhost-net driver
    virt/kvm/eventfd.c - ioeventfd and irqfd

The QEMU userspace code shows how to initialize the vhost instance:

    hw/vhost.c - common vhost initialization code
    hw/vhost_net.c - vhost-net initialization

The datapath - direct memory access

How memory is mapped for the virtq

The virtio standard defines exactly what a virtq should look like.

2.4 Virtqueues

The mechanism for bulk data transport on virtio devices is pretentiously called a virtqueue. Each device can have zero or more virtqueues. Each queue has a 16-bit queue size parameter, which sets the number of entries and implies the total size of the queue.

Each virtqueue consists of three parts:

    Descriptor Table
    Available Ring
    Used Ring

http://docs.oasis-open.org/virtio/virtio/v1.0/virtio-v1.0.html

The standard exactly defines the structure of the descriptor table, available ring and used ring. For example, for the available ring:

2.4.6 The Virtqueue Available Ring
struct virtq_avail {
#define VIRTQ_AVAIL_F_NO_INTERRUPT      1
        le16 flags;
        le16 idx;
        le16 ring[ /* Queue Size */ ];
        le16 used_event; /* Only if VIRTIO_F_EVENT_IDX */
};

The driver uses the available ring to offer buffers to the device: each ring entry refers to the head of a descriptor chain. It is only written by the driver and read by the device.

idx field indicates where the driver would put the next descriptor entry in the ring (modulo the queue size). This starts at 0, and increases. Note: The legacy [Virtio PCI Draft] referred to this structure as vring_avail, and the constant as VRING_AVAIL_F_NO_INTERRUPT, but the layout and value were identical.

http://docs.oasis-open.org/virtio/virtio/v1.0/virtio-v1.0.html

In order to make direct memory access possible, DPDK implements the above standard.

dpdk-stable-16.11.4/drivers/net/virtio/virtio_ring.h

 48 /* The Host uses this in used->flags to advise the Guest: don't kick me
 49  * when you add a buffer.  It's unreliable, so it's simply an
 50  * optimization.  Guest will still kick if it's out of buffers. */
 51 #define VRING_USED_F_NO_NOTIFY  1
 52 /* The Guest uses this in avail->flags to advise the Host: don't
 53  * interrupt me when you consume a buffer.  It's unreliable, so it's
 54  * simply an optimization.  */
 55 #define VRING_AVAIL_F_NO_INTERRUPT  1
 56
 57 /* VirtIO ring descriptors: 16 bytes.
 58  * These can chain together via "next". */
 59 struct vring_desc {
 60         uint64_t addr;  /*  Address (guest-physical). */
 61         uint32_t len;   /* Length. */
 62         uint16_t flags; /* The flags as indicated above. */
 63         uint16_t next;  /* We chain unused descriptors via this. */
 64 };
 65
 66 struct vring_avail {
 67         uint16_t flags;
 68         uint16_t idx;
 69         uint16_t ring[0];
 70 };
 71
 72 /* id is a 16bit index. uint32_t is used here for ids for padding reasons. */
 73 struct vring_used_elem {
 74         /* Index of start of used descriptor chain. */
 75         uint32_t id;
 76         /* Total length of the descriptor chain which was written to. */
 77         uint32_t len;
 78 };
 79
 80 struct vring_used {
 81         uint16_t flags;
 82         volatile uint16_t idx;
 83         struct vring_used_elem ring[0];
 84 };
 85
 86 struct vring {
 87         unsigned int num;
 88         struct vring_desc  *desc;
 89         struct vring_avail *avail;
 90         struct vring_used  *used;
 91 };

dpdk-stable-16.11.4/lib/librte_vhost/vhost.h

 81 struct vhost_virtqueue {
 82         struct vring_desc       *desc;
 83         struct vring_avail      *avail;
 84         struct vring_used       *used;
 85         uint32_t                size;
 86
 87         uint16_t                last_avail_idx;
 88         uint16_t                last_used_idx;
 89 #define VIRTIO_INVALID_EVENTFD          (-1)
 90 #define VIRTIO_UNINITIALIZED_EVENTFD    (-2)
 91
 92         /* Backend value to determine if device should started/stopped */
 93         int                     backend;
 94         /* Used to notify the guest (trigger interrupt) */
 95         int                     callfd;
 96         /* Currently unused as polling mode is enabled */
 97         int                     kickfd;
 98         int                     enabled;
 99
100         /* Physical address of used ring, for logging */
101         uint64_t                log_guest_addr;
102
103         uint16_t                nr_zmbuf;
104         uint16_t                zmbuf_size;
105         uint16_t                last_zmbuf_idx;
106         struct zcopy_mbuf       *zmbufs;
107         struct zcopy_mbuf_list  zmbuf_list;
108
109         struct vring_used_elem  *shadow_used_ring;
110         uint16_t                shadow_used_idx;
111 } __rte_cache_aligned;

Once the memory mapping is done, DPDK can directly act on and manipulate the same structures as virtio-net within the guest's shared memory.

The control path - Unix sockets

qemu and DPDK message exchange over vhost user socket

DPDK and qemu communicate via the standardized vhost-user protocol.

The message types are:
dpdk-stable-16.11.4/lib/librte_vhost/vhost_user.h

 54 typedef enum VhostUserRequest {
 55         VHOST_USER_NONE = 0,
 56         VHOST_USER_GET_FEATURES = 1,
 57         VHOST_USER_SET_FEATURES = 2,
 58         VHOST_USER_SET_OWNER = 3,
 59         VHOST_USER_RESET_OWNER = 4,
 60         VHOST_USER_SET_MEM_TABLE = 5,
 61         VHOST_USER_SET_LOG_BASE = 6,
 62         VHOST_USER_SET_LOG_FD = 7,
 63         VHOST_USER_SET_VRING_NUM = 8,
 64         VHOST_USER_SET_VRING_ADDR = 9,
 65         VHOST_USER_SET_VRING_BASE = 10,
 66         VHOST_USER_GET_VRING_BASE = 11,
 67         VHOST_USER_SET_VRING_KICK = 12,
 68         VHOST_USER_SET_VRING_CALL = 13,
 69         VHOST_USER_SET_VRING_ERR = 14,
 70         VHOST_USER_GET_PROTOCOL_FEATURES = 15,
 71         VHOST_USER_SET_PROTOCOL_FEATURES = 16,
 72         VHOST_USER_GET_QUEUE_NUM = 17,
 73         VHOST_USER_SET_VRING_ENABLE = 18,
 74         VHOST_USER_SEND_RARP = 19,
 75         VHOST_USER_MAX
 76 } VhostUserRequest;

Further details about the message types can be found in qemu's source code in: https://github.com/qemu/qemu/blob/master/docs/interop/vhost-user.txt

DPDK processes incoming messages with …
dpdk-stable-16.11.4/lib/librte_vhost/vhost_user.c

 920 int
 921 vhost_user_msg_handler(int vid, int fd)
 922 {

… which uses:
dpdk-stable-16.11.4/lib/librte_vhost/vhost_user.c:

 872 /* return bytes# of read on success or negative val on failure. */
 873 static int
 874 read_vhost_message(int sockfd, struct VhostUserMsg *msg)
 875 {

DPDK writes outgoing messages with:
dpdk-stable-16.11.4/lib/librte_vhost/vhost_user.c

 902 static int
 903 send_vhost_message(int sockfd, struct VhostUserMsg *msg)
 904 {

qemu has an equivalent method for receiving:
qemu-2.9.0/contrib/libvhost-user/libvhost-user.c

746 static bool
747 vu_process_message(VuDev *dev, VhostUserMsg *vmsg)

And qemu obviously also has an equivalent method for sending:
qemu-2.9.0/hw/virtio/vhost-user.c

198 /* most non-init callers ignore the error */
199 static int vhost_user_write(struct vhost_dev *dev, VhostUserMsg *msg,
200                             int *fds, int fd_num)
201 {

How DPDK registers the Unix socket and uses it for message exchange

neutron instructs Open vSwitch to create a port with name vhuxxxxxxxx-xx. Within OVS, this name is saved in the netdev structure as netdev->name.

When it creates the vhost user port, Open vSwitch instructs DPDK to register a new vhost-user socket. The socket's path is set as dev->vhost_id which is a concatenation of vhost_sock_dir and netdev->name.

OVS can request to create the socket in vhost user client mode by passing the RTE_VHOST_USER_CLIENT flag.

OVS' netdev_dpdk_vhost_construct method calls DPDK's rte_vhost_driver_register method, which in turn executes vhost_user_create_server or vhost_user_create_client. By default, vhost user server mode is used, or if RTE_VHOST_USER_CLIENT is set, vhost user client mode.

Overview of the involved methods:

OVS
                       netdev_dpdk_vhost_construct
                        (struct netdev *netdev)
                                   |
                                   |
DPDK                               V
                      rte_vhost_driver_register
                        (const char *path, uint64_t flags)
                                   |
             -----------------------------------------------                     
             |                                             |
             V                                             |
vhost_user_create_server                                   |
  (struct vhost_user_socket *vsocket)                      |
             |                                             |
             V                                             V
vhost_user_server_new_connection                     vhost_user_create_client                     vhost_user_client_reconnect
(int fd, void *dat, int *remove __rte_unused)          (struct vhost_user_socket *vsocket)          (void *arg __rte_unused)
             |                                             |                                                  |
             V                                             V                                                  V
             --------------------------------------------------------------------------------------------------
                                                           |
                                                           V
                                               vhost_user_add_connection
                                                 (int fd, struct vhost_user_socket *vsocket)
                                                           |
                                                           V
                                               vhost_user_read_cb
                                                (int connfd, void *dat, int *remove)
                                                           |
                                                           V
                                                vhost_user_msg_handler

netdev_dpdk_vhost_construct is in openvswitch-2.6.1/lib/netdev-dpdk.c:

 886 static int
 887 netdev_dpdk_vhost_construct(struct netdev *netdev)
 888 {
 889     struct netdev_dpdk *dev = netdev_dpdk_cast(netdev);
 890     const char *name = netdev->name;
 891     int err;
 892
 893     /* 'name' is appended to 'vhost_sock_dir' and used to create a socket in
 894      * the file system. '/' or '\' would traverse directories, so they're not
 895      * acceptable in 'name'. */
 896     if (strchr(name, '/') || strchr(name, '\\')) {
 897         VLOG_ERR("\"%s\" is not a valid name for a vhost-user port. "
 898                  "A valid name must not include '/' or '\\'",
 899                  name);
 900         return EINVAL;
 901     }
 902
 903     if (rte_eal_init_ret) {
 904         return rte_eal_init_ret;
 905     }
 906
 907     ovs_mutex_lock(&dpdk_mutex);
 908     /* Take the name of the vhost-user port and append it to the location where
 909      * the socket is to be created, then register the socket.
 910      */
 911     snprintf(dev->vhost_id, sizeof dev->vhost_id, "%s/%s",
 912              vhost_sock_dir, name);
 913
 914     dev->vhost_driver_flags &= ~RTE_VHOST_USER_CLIENT;
 915     err = rte_vhost_driver_register(dev->vhost_id, dev->vhost_driver_flags);
 916     if (err) {
 917         VLOG_ERR("vhost-user socket device setup failure for socket %s\n",
 918                  dev->vhost_id);
 919     } else {
 920         fatal_signal_add_file_to_unlink(dev->vhost_id);
 921         VLOG_INFO("Socket %s created for vhost-user port %s\n",
 922                   dev->vhost_id, name);
 923     }
 924     err = netdev_dpdk_init(netdev, -1, DPDK_DEV_VHOST);
 925
 926     ovs_mutex_unlock(&dpdk_mutex);
 927     return err;
 928 }

netdev_dpdk_vhost_construct calls rte_vhost_driver_register. All of the following code is in dpdk-stable-16.11.4/lib/librte_vhost/socket.c:

494 /*
495  * Register a new vhost-user socket; here we could act as server
496  * (the default case), or client (when RTE_VHOST_USER_CLIENT) flag
497  * is set.
498  */
499 int
500 rte_vhost_driver_register(const char *path, uint64_t flags)
501 {
(...)
525         if ((flags & RTE_VHOST_USER_CLIENT) != 0) {
526                 vsocket->reconnect = !(flags & RTE_VHOST_USER_NO_RECONNECT);
527                 if (vsocket->reconnect && reconn_tid == 0) {
528                         if (vhost_user_reconnect_init() < 0) {
529                                 free(vsocket->path);
530                                 free(vsocket);
531                                 goto out;
532                         }
533                 }
534                 ret = vhost_user_create_client(vsocket);
535         } else {
536                 vsocket->is_server = true;
537                 ret = vhost_user_create_server(vsocket);
538         }

vhost_user_create_server calls vhost_user_server_new_connection:

304 static int
305 vhost_user_create_server(struct vhost_user_socket *vsocket)
306 {
307         int fd;
308         int ret;
309         struct sockaddr_un un;
310         const char *path = vsocket->path;
311
312         fd = create_unix_socket(path, &un, vsocket->is_server);

And any of the 3 following methods calls vhost_user_add_connection:

239 /* call back when there is new vhost-user connection from client  */
240 static void
241 vhost_user_server_new_connection(int fd, void *dat, int *remove __rte_unused)
242 {
(...)
386 static void *
387 vhost_user_client_reconnect(void *arg __rte_unused)
388 {
(...)
447 static int
448 vhost_user_create_client(struct vhost_user_socket *vsocket)
449 {
(...)
190 static void
191 vhost_user_add_connection(int fd, struct vhost_user_socket *vsocket)
192 {

vhost_user_add_connection then executes vhost_user_read_cb which in turn runs vhost_user_msg_handler for incoming message handling.

253 static void
254 vhost_user_read_cb(int connfd, void *dat, int *remove)
255 {
256         struct vhost_user_connection *conn = dat;
257         struct vhost_user_socket *vsocket = conn->vsocket;
258         int ret;
259
260         ret = vhost_user_msg_handler(conn->vid, connfd);
261         if (ret < 0) {
262                 close(connfd);
263                 *remove = 1;
264                 vhost_destroy_device(conn->vid);
265
266                 pthread_mutex_lock(&vsocket->conn_mutex);
267                 TAILQ_REMOVE(&vsocket->conn_list, conn, next);
268                 pthread_mutex_unlock(&vsocket->conn_mutex);
269
270                 free(conn);
271
272                 if (vsocket->reconnect)
273                         vhost_user_create_client(vsocket);
274         }
275 }

dpdk-stable-16.11.4/lib/librte_vhost/vhost_user.c

  920 int
 921 vhost_user_msg_handler(int vid, int fd)
 922 {
 923         struct virtio_net *dev;
 924         struct VhostUserMsg msg;
 925         int ret;
 926 
 927         dev = get_device(vid);
 928         if (dev == NULL)
 929                 return -1;
 930 
 931         ret = read_vhost_message(fd, &msg);
 932         if (ret <= 0 || msg.request >= VHOST_USER_MAX) {
 933                 if (ret < 0)
 934                         RTE_LOG(ERR, VHOST_CONFIG,
 935                                 "vhost read message failed\n");
 936                 else if (ret == 0)
 937                         RTE_LOG(INFO, VHOST_CONFIG,
 938                                 "vhost peer closed\n");
 939                 else
 940                         RTE_LOG(ERR, VHOST_CONFIG,
 941                                 "vhost read incorrect message\n");
 942 
 943                 return -1;
 944         }
 945 
 946         RTE_LOG(INFO, VHOST_CONFIG, "read message %s\n",
 947                 vhost_message_str[msg.request]);
 948         switch (msg.request) {
 949         case VHOST_USER_GET_FEATURES:
 950                 msg.payload.u64 = vhost_user_get_features();
 951                 msg.size = sizeof(msg.payload.u64);
 952                 send_vhost_message(fd, &msg);
 953                 break;
 954         case VHOST_USER_SET_FEATURES:
 955                 vhost_user_set_features(dev, msg.payload.u64);
 956                 break;
 957 
 958         case VHOST_USER_GET_PROTOCOL_FEATURES:
 959                 msg.payload.u64 = VHOST_USER_PROTOCOL_FEATURES;
 960                 msg.size = sizeof(msg.payload.u64);
 961                 send_vhost_message(fd, &msg);
 962                 break;
 963         case VHOST_USER_SET_PROTOCOL_FEATURES:
 964                 vhost_user_set_protocol_features(dev, msg.payload.u64);
 965                 break;
 966 
 967         case VHOST_USER_SET_OWNER:
 968                 vhost_user_set_owner();
 969                 break;
 970         case VHOST_USER_RESET_OWNER:
 971                 vhost_user_reset_owner(dev);
 972                 break;
 973 
 974         case VHOST_USER_SET_MEM_TABLE:
 975                 vhost_user_set_mem_table(dev, &msg);
 976                 break;
 977 
 978         case VHOST_USER_SET_LOG_BASE:
 979                 vhost_user_set_log_base(dev, &msg);
 980 
 981                 /* it needs a reply */
 982                 msg.size = sizeof(msg.payload.u64);
 983                 send_vhost_message(fd, &msg);
 984                 break;
 985         case VHOST_USER_SET_LOG_FD:
 986                 close(msg.fds[0]);
 987                 RTE_LOG(INFO, VHOST_CONFIG, "not implemented.\n");
 988                 break;
 989 
 990         case VHOST_USER_SET_VRING_NUM:
 991                 vhost_user_set_vring_num(dev, &msg.payload.state);
 992                 break;
 993         case VHOST_USER_SET_VRING_ADDR:
 994                 vhost_user_set_vring_addr(dev, &msg.payload.addr);
 995                 break;
 996         case VHOST_USER_SET_VRING_BASE:
 997                 vhost_user_set_vring_base(dev, &msg.payload.state);
 998                 break;
 999 
1000         case VHOST_USER_GET_VRING_BASE:
1001                 ret = vhost_user_get_vring_base(dev, &msg.payload.state);
1002                 msg.size = sizeof(msg.payload.state);
1003                 send_vhost_message(fd, &msg);
1004                 break;
1005 
1006         case VHOST_USER_SET_VRING_KICK:
1007                 vhost_user_set_vring_kick(dev, &msg);
1008                 break;
1009         case VHOST_USER_SET_VRING_CALL:
1010                 vhost_user_set_vring_call(dev, &msg);
1011                 break;
1012 
1013         case VHOST_USER_SET_VRING_ERR:
1014                 if (!(msg.payload.u64 & VHOST_USER_VRING_NOFD_MASK))
1015                         close(msg.fds[0]);
1016                 RTE_LOG(INFO, VHOST_CONFIG, "not implemented\n");
1017                 break;
1018 
1019         case VHOST_USER_GET_QUEUE_NUM:
1020                 msg.payload.u64 = VHOST_MAX_QUEUE_PAIRS;
1021                 msg.size = sizeof(msg.payload.u64);
1022                 send_vhost_message(fd, &msg);
1023                 break;
1024 
1025         case VHOST_USER_SET_VRING_ENABLE:
1026                 vhost_user_set_vring_enable(dev, &msg.payload.state);
1027                 break;
1028         case VHOST_USER_SEND_RARP:
1029                 vhost_user_send_rarp(dev, &msg);
1030                 break;
1031 
1032         default:
1033                 break;
1034 
1035         }
1036 
1037         return 0;
1038 }

How virtio communicates the virtio queue’s memory addresses to DPDK

DPDK uses a method called vhost_user_set_vring_addr to convert virtio's desc, used and avail ring addresses to its own address space.

dpdk-stable-16.11.4/lib/librte_vhost/vhost_user.c

 324 /*
 325  * The virtio device sends us the desc, used and avail ring addresses.
 326  * This function then converts these to our address space.
 327  */
 328 static int
 329 vhost_user_set_vring_addr(struct virtio_net *dev, struct vhost_vring_addr *addr)
 330 {
 331         struct vhost_virtqueue *vq;
 332
 333         if (dev->mem == NULL)
 334                 return -1;
 335
 336         /* addr->index refers to the queue index. The txq 1, rxq is 0. */
 337         vq = dev->virtqueue[addr->index];
 338
 339         /* The addresses are converted from QEMU virtual to Vhost virtual. */
 340         vq->desc = (struct vring_desc *)(uintptr_t)qva_to_vva(dev,
 341                         addr->desc_user_addr);
 342         if (vq->desc == 0) {
 343                 RTE_LOG(ERR, VHOST_CONFIG,
 344                         "(%d) failed to find desc ring address.\n",
 345                         dev->vid);
 346                 return -1;
 347         }
 348
 349         dev = numa_realloc(dev, addr->index);
 350         vq = dev->virtqueue[addr->index];
 351
 352         vq->avail = (struct vring_avail *)(uintptr_t)qva_to_vva(dev,
 353                         addr->avail_user_addr);
 354         if (vq->avail == 0) {
 355                 RTE_LOG(ERR, VHOST_CONFIG,
 356                         "(%d) failed to find avail ring address.\n",
 357                         dev->vid);
 358                 return -1;
 359         }
 360
 361         vq->used = (struct vring_used *)(uintptr_t)qva_to_vva(dev,
 362                         addr->used_user_addr);
 363         if (vq->used == 0) {
 364                 RTE_LOG(ERR, VHOST_CONFIG,
 365                         "(%d) failed to find used ring address.\n",
 366                         dev->vid);
 367                 return -1;
 368         }
 369
 370         if (vq->last_used_idx != vq->used->idx) {
 371                 RTE_LOG(WARNING, VHOST_CONFIG,
 372                         "last_used_idx (%u) and vq->used->idx (%u) mismatches; "
 373                         "some packets maybe resent for Tx and dropped for Rx\n",
 374                         vq->last_used_idx, vq->used->idx);
 375                 vq->last_used_idx  = vq->used->idx;
 376                 vq->last_avail_idx = vq->used->idx;
 377         }
 378
 379         vq->log_guest_addr = addr->log_guest_addr;
 380
 381         LOG_DEBUG(VHOST_CONFIG, "(%d) mapped address desc: %p\n",
 382                         dev->vid, vq->desc);
 383         LOG_DEBUG(VHOST_CONFIG, "(%d) mapped address avail: %p\n",
 384                         dev->vid, vq->avail);
 385         LOG_DEBUG(VHOST_CONFIG, "(%d) mapped address used: %p\n",
 386                         dev->vid, vq->used);
 387         LOG_DEBUG(VHOST_CONFIG, "(%d) log_guest_addr: %" PRIx64 "\n",
 388                         dev->vid, vq->log_guest_addr);
 389
 390         return 0;
 391 }

The memory will be set if a message of type VHOST_USER_SET_VRING_ADDR arrives via the vhu socket:
dpdk-stable-16.11.4/lib/librte_vhost/vhost_user.c

 920 int
 921 vhost_user_msg_handler(int vid, int fd)
 922 {
 923         struct virtio_net *dev;
 924         struct VhostUserMsg msg;
 925         int ret;
 926
 927         dev = get_device(vid);
 928         if (dev == NULL)
 929                 return -1;
 930
 931         ret = read_vhost_message(fd, &msg);
 932         if (ret <= 0 || msg.request >= VHOST_USER_MAX) {
 933                 if (ret < 0)
 934                         RTE_LOG(ERR, VHOST_CONFIG,
 935                                 "vhost read message failed\n");
 936                 else if (ret == 0)
 937                         RTE_LOG(INFO, VHOST_CONFIG,
 938                                 "vhost peer closed\n");
 939                 else
 940                         RTE_LOG(ERR, VHOST_CONFIG,
 941                                 "vhost read incorrect message\n");
 942
 943                 return -1;
 944         }
 945
 946         RTE_LOG(INFO, VHOST_CONFIG, "read message %s\n",
 947                 vhost_message_str[msg.request]);
 948         switch (msg.request) {
 949         case VHOST_USER_GET_FEATURES:
 950                 msg.payload.u64 = vhost_user_get_features();
 951                 msg.size = sizeof(msg.payload.u64);
 952                 send_vhost_message(fd, &msg);
 953                 break;
(...)
 993         case VHOST_USER_SET_VRING_ADDR:
 994                 vhost_user_set_vring_addr(dev, &msg.payload.addr);
 995                 break;

As a matter of fact, qemu has an equivalent to that same method in DPDK:
qemu-2.9.0/contrib/libvhost-user/libvhost-user.c

746 static bool
 747 vu_process_message(VuDev *dev, VhostUserMsg *vmsg)
 748 {
 749     int do_reply = 0;
 750
 751     /* Print out generic part of the request. */
 752     DPRINT("================ Vhost user message ================\n");
 753     DPRINT("Request: %s (%d)\n", vu_request_to_string(vmsg->request),
 754            vmsg->request);
 755     DPRINT("Flags:   0x%x\n", vmsg->flags);
 756     DPRINT("Size:    %d\n", vmsg->size);
 757
 758     if (vmsg->fd_num) {
 759         int i;
 760         DPRINT("Fds:");
 761         for (i = 0; i < vmsg->fd_num; i++) {
 762             DPRINT(" %d", vmsg->fds[i]);
 763         }
 764         DPRINT("\n");
 765     }
 766
 767     if (dev->iface->process_msg &&
 768         dev->iface->process_msg(dev, vmsg, &do_reply)) {
 769         return do_reply;
 770     }
 771
 772     switch (vmsg->request) {
 773     case VHOST_USER_GET_FEATURES:
 774         return vu_get_features_exec(dev, vmsg);
(...)
 793     case VHOST_USER_SET_VRING_ADDR:
 794         return vu_set_vring_addr_exec(dev, vmsg);
(...)

Obviously, it also has a method to communicate that address via the socket to DPDK
qemu-2.9.0/hw/virtio/vhost-user.c:

329 static int vhost_user_set_vring_addr(struct vhost_dev *dev,
330                                      struct vhost_vring_addr *addr)
331 {
332     VhostUserMsg msg = {
333         .request = VHOST_USER_SET_VRING_ADDR,
334         .flags = VHOST_USER_VERSION,
335         .payload.addr = *addr,
336         .size = sizeof(msg.payload.addr),
337     };
338
339     if (vhost_user_write(dev, &msg, NULL, 0) < 0) {
340         return -1;
341     }
342
343     return 0;
344 }

How does OVS DPDK with vhost user transmit packets into an instance and when to Tx drops occur?

The code that handles Tx from OVS DPDK to the instance is in __netdev_dpdk_vhost_send() function in the lib/netdev-dpdk.c.

OVS tries to send and if there is no more space but there was a progress, it retries for VHOST_ENQ_RETRY_NUM (by default 8) times. If there was no progress on the first try (no packets pushed to the ring), or it exceeds VHOST_ENQ_RETRY_NUM times, then it drops all remaining packets in the batch (up to cnt which is a maximum of 32 packets).

1520     do {
1523
1524         tx_pkts = rte_vhost_enqueue_burst(netdev_dpdk_get_vid(dev),
1525                                           vhost_qid, cur_pkts, cnt);
1526         if (OVS_LIKELY(tx_pkts)) {
1527             /* Packets have been sent.*/
1528             cnt -= tx_pkts;
1529             /* Prepare for possible retry.*/
1530             cur_pkts = &cur_pkts[tx_pkts];
1531         } else {
1532             /* No packets sent - do not retry.*/
1533             break;
1534         }
1535     } while (cnt && (retries++ <= VHOST_ENQ_RETRY_NUM));
1536
1545     for (i = 0; i < total_pkts - dropped; i++) {
1546         dp_packet_delete(pkts[i]);
1547     }

Instance Rx interrupt handling

When OVS DPDK puts new data into virtio's ring, two possible scenarios exist:

  • the instance is not polling its queues and hence needs to be made aware that new packets arrived

  • the instance is currently polling and hence there is no need to tell it that new data is in the ring

Within an instance that uses Linux kernel networking, the Linux Networking stack uses NAPI which is a mix of interrupt and polling modes. The guest OS starts in interrupt mode, so it does nothing until the first interrupt comes in. When that happens, the CPU quickly ACKs the IRQ and schedules the ksoftirqd thread to run the callback. It will also disable any further IRQs.

When ksoftirqd runs, it will try to poll as many packets as possible up to the netdev_budget. If there are more packets in the queue, then ksoftirqd will reschedule itself, and repeat the operation until no more packets are available. Note that it is polling and there is no need for further interrupts. When no more packets are available, ksoftirqd will stop polling and re-enable the IRQ to wait for the next packet/interrupt.

When the instance is polling, CPU caches are hot, low extra latency or delays, and the right processes in the host and in the VM are running, further reducing latency. Additionally, in order for the host to send an IRQ to the guest, it needs to write to the Unix socket (syscall), which is expensive, and adds more to the extra delays/cost.

The advantage of running DPDK within the instance as part of the NFV application lies in how PMDs handle traffic: because PMDs are constantly polling, they can switch off the interrupt constantly and hence OVS DPDK does not have to kick the VM any more. This saves OVS DPDK from having to write to the socket and hence spares it from having to execute a write system call to the kernel. This is a win-win situation: OVS DPDK's PMD remains in user space and instances can process packets faster because they skip the extra overhead of being kicked through the control plan and interrupt handling.

The guest can receive an interrupt if flag VRING_AVAIL_F_NO_INTERRUPT is not set. This interrupt is transmitted to the guest via callfd and the OS' eventfd object.

The guest OS can hence enable or disable interrupts. When the guest disables interrupts on the virtio interface then virtio-net will translate this and will use flag VRING_AVAIL_F_NO_INTERRUPT which is defined in both DPDK and qemu:

[root@overcloud-compute-0 SOURCES]# grep VRING_AVAIL_F_NO_INTERRUPT -R | grep def
dpdk-stable-16.11.5/drivers/net/virtio/virtio_ring.h:#define VRING_AVAIL_F_NO_INTERRUPT  1
dpdk-stable-17.11/drivers/net/virtio/virtio_ring.h:#define VRING_AVAIL_F_NO_INTERRUPT  1
[root@overcloud-compute-0 SOURCES]#
[root@overcloud-compute-0 qemu]# grep AVAIL_F_NO_INTERRUPT  -R -i | grep def
qemu-2.9.0/roms/SLOF/lib/libvirtio/virtio.h:#define VRING_AVAIL_F_NO_INTERRUPT    1
qemu-2.9.0/roms/ipxe/src/include/ipxe/virtio-ring.h:#define VRING_AVAIL_F_NO_INTERRUPT 1
qemu-2.9.0/roms/seabios/src/hw/virtio-ring.h:#define VRING_AVAIL_F_NO_INTERRUPT 1
qemu-2.9.0/include/standard-headers/linux/virtio_ring.h:#define VRING_AVAIL_F_NO_INTERRUPT    1

Once the bit for VRING_AVAIL_F_NO_INTERRUPT in vq->avail->flags is set, this will instruct DPDK not to kick the instance. From dpdk-stable-16.11.4/lib/librte_vhost/virtio_net.c:

(...)
        /* Kick the guest if necessary. */
        if (!(vq->avail->flags & VRING_AVAIL_F_NO_INTERRUPT)
                        && (vq->callfd >= 0))
                eventfd_write(vq->callfd, (eventfd_t)1);
        return count;
(...)

Note that, as explained earlier, this also saves the PMD from having to execute a system call.

A more detailed look at the code - OVS DPDK Tx towards the instance

Further investigation can be done in the code in lib/netdev-dpdk.c:

1487 static void
1488 __netdev_dpdk_vhost_send(struct netdev *netdev, int qid,
1489                          struct dp_packet **pkts, int cnt)
1490 {
1491     struct netdev_dpdk *dev = netdev_dpdk_cast(netdev);
1492     struct rte_mbuf **cur_pkts = (struct rte_mbuf **) pkts;
1493     unsigned int total_pkts = cnt;
1494     unsigned int dropped = 0;
1495     int i, retries = 0;
1496
1497     qid = dev->tx_q[qid % netdev->n_txq].map;
1498
1499     if (OVS_UNLIKELY(!is_vhost_running(dev) || qid < 0
1500                      || !(dev->flags & NETDEV_UP))) {
1501         rte_spinlock_lock(&dev->stats_lock);
1502         dev->stats.tx_dropped+= cnt;
1503         rte_spinlock_unlock(&dev->stats_lock);
1504         goto out;
1505     }
1506
1507     rte_spinlock_lock(&dev->tx_q[qid].tx_lock);
1508
1509     cnt = netdev_dpdk_filter_packet_len(dev, cur_pkts, cnt);
1510     /* Check has QoS has been configured for the netdev */
1511     cnt = netdev_dpdk_qos_run__(dev, cur_pkts, cnt);
1512     dropped = total_pkts - cnt;
1513
1514     do {
1515         int vhost_qid = qid * VIRTIO_QNUM + VIRTIO_RXQ;
1516         unsigned int tx_pkts;
1517
1518         tx_pkts = rte_vhost_enqueue_burst(netdev_dpdk_get_vid(dev),
1519                                           vhost_qid, cur_pkts, cnt);
1520         if (OVS_LIKELY(tx_pkts)) {
1521             /* Packets have been sent.*/
1522             cnt -= tx_pkts;
1523             /* Prepare for possible retry.*/
1524             cur_pkts = &cur_pkts[tx_pkts];
1525         } else {
1526             /* No packets sent - do not retry.*/
1527             break;
1528         }
1529     } while (cnt && (retries++ <= VHOST_ENQ_RETRY_NUM));
1530
1531     rte_spinlock_unlock(&dev->tx_q[qid].tx_lock);
1532
1533     rte_spinlock_lock(&dev->stats_lock);
1534     netdev_dpdk_vhost_update_tx_counters(&dev->stats, pkts, total_pkts,
1535                                          cnt + dropped);
1536     rte_spinlock_unlock(&dev->stats_lock);
1537
1538 out:
1539     for (i = 0; i < total_pkts - dropped; i++) {
1540         dp_packet_delete(pkts[i]);
1541     }
1542 }

The work is executed here:

    do {
        int vhost_qid = qid * VIRTIO_QNUM + VIRTIO_RXQ;
        unsigned int tx_pkts;

        tx_pkts = rte_vhost_enqueue_burst(netdev_dpdk_get_vid(dev),
                                          vhost_qid, cur_pkts, cnt);
        if (OVS_LIKELY(tx_pkts)) {
            /* Packets have been sent.*/
            cnt -= tx_pkts;
            /* Prepare for possible retry.*/
            cur_pkts = &cur_pkts[tx_pkts];
        } else {
            /* No packets sent - do not retry.*/
            break;
        }
    } while (cnt && (retries++ <= VHOST_ENQ_RETRY_NUM));

Method rte_vhost_enqueue_burst comes directly from DPDK's vhost library:

[root@overcloud-compute-0 src]# grep rte_vhost_enqueue_burst dpdk-stable-16.11.4/ -R
dpdk-stable-16.11.4/doc/guides/prog_guide/vhost_lib.rst:* ``rte_vhost_enqueue_burst(vid, queue_id, pkts, count)``
dpdk-stable-16.11.4/doc/guides/rel_notes/release_16_07.rst:* The function ``rte_vhost_enqueue_burst`` no longer supports concurrent enqueuing
dpdk-stable-16.11.4/drivers/net/vhost/rte_eth_vhost.c:    nb_tx = rte_vhost_enqueue_burst(r->vid,
dpdk-stable-16.11.4/examples/tep_termination/vxlan_setup.c:    ret = rte_vhost_enqueue_burst(vid, VIRTIO_RXQ, pkts_valid, count);
dpdk-stable-16.11.4/examples/vhost/main.c:    ret = rte_vhost_enqueue_burst(dst_vdev->vid, VIRTIO_RXQ, &m, 1);
dpdk-stable-16.11.4/examples/vhost/main.c:    enqueue_count = rte_vhost_enqueue_burst(vdev->vid, VIRTIO_RXQ,
dpdk-stable-16.11.4/lib/librte_vhost/rte_vhost_version.map:    rte_vhost_enqueue_burst;
dpdk-stable-16.11.4/lib/librte_vhost/rte_virtio_net.h:uint16_t rte_vhost_enqueue_burst(int vid, uint16_t queue_id,
dpdk-stable-16.11.4/lib/librte_vhost/virtio_net.c:rte_vhost_enqueue_burst(int vid, uint16_t queue_id,

dpdk-stable-16.11.4/lib/librte_vhost/rte_virtio_net.h

/**
 * This function adds buffers to the virtio devices RX virtqueue. Buffers can
 * be received from the physical port or from another virtual device. A packet
 * count is returned to indicate the number of packets that were succesfully
 * added to the RX queue.
 * @param vid
 *  virtio-net device ID
 * @param queue_id
 *  virtio queue index in mq case
 * @param pkts
 *  array to contain packets to be enqueued
 * @param count
 *  packets num to be enqueued
 * @return
 *  num of packets enqueued
 */
uint16_t rte_vhost_enqueue_burst(int vid, uint16_t queue_id,
        struct rte_mbuf **pkts, uint16_t count);

dpdk-stable-16.11.4/lib/librte_vhost/virtio_net.c

uint16_t
rte_vhost_enqueue_burst(int vid, uint16_t queue_id,
        struct rte_mbuf **pkts, uint16_t count)
{
        struct virtio_net *dev = get_device(vid);

        if (!dev)
                return 0;

        if (dev->features & (1 << VIRTIO_NET_F_MRG_RXBUF))
                return virtio_dev_merge_rx(dev, queue_id, pkts, count);
        else
                return virtio_dev_rx(dev, queue_id, pkts, count);
}

Both of these methods will transmit packets into the instance and will notify the instance via an interrupt (write) if necessary:

dpdk-stable-16.11.4/lib/librte_vhost/virtio_net.c

/**
 * This function adds buffers to the virtio devices RX virtqueue. Buffers can
 * be received from the physical port or from another virtio device. A packet
 * count is returned to indicate the number of packets that are succesfully
 * added to the RX queue. This function works when the mbuf is scattered, but
 * it doesn't support the mergeable feature.
 */
static inline uint32_t __attribute__((always_inline))
virtio_dev_rx(struct virtio_net *dev, uint16_t queue_id,
              struct rte_mbuf **pkts, uint32_t count)
(...)
        /* Kick the guest if necessary. */
        if (!(vq->avail->flags & VRING_AVAIL_F_NO_INTERRUPT)
                        && (vq->callfd >= 0))
                eventfd_write(vq->callfd, (eventfd_t)1);
        return count;
(...)
static inline uint32_t __attribute__((always_inline))
virtio_dev_merge_rx(struct virtio_net *dev, uint16_t queue_id,
        struct rte_mbuf **pkts, uint32_t count)
(...)

From the above method virtio_dev_rx:

 319         avail_idx = *((volatile uint16_t *)&vq->avail->idx);
 320         start_idx = vq->last_used_idx;
 321         free_entries = avail_idx - start_idx;
 322         count = RTE_MIN(count, free_entries);
 323         count = RTE_MIN(count, (uint32_t)MAX_PKT_BURST);
 324         if (count == 0)
 325                 return 0;

The number of packets to be transmitted is hence set to MIN( [number of free entries] , [MAX_PKT_BURST] ,[ number of packets to be transmitted passed from calling method] ).

Another unlikely case will set the count to a lower value and will break the for loop:

 344         for (i = 0; i < count; i++) {
 345                 uint16_t desc_idx = desc_indexes[i];
 346                 int err;
 347
 348                 if (vq->desc[desc_idx].flags & VRING_DESC_F_INDIRECT) {
 349                         descs = (struct vring_desc *)(uintptr_t)gpa_to_vva(dev,
 350                                         vq->desc[desc_idx].addr);
 351                         if (unlikely(!descs)) {
 352                                 count = i;
 353                                 break;
 354                         }

In the end, the used index is increased by count:

 378         *(volatile uint16_t *)&vq->used->idx += count;
 379         vq->last_used_idx += count;
 380         vhost_log_used_vring(dev, vq,
 381                 offsetof(struct vring_used, idx),
 382                 sizeof(vq->used->idx));

Data is copied into the instance's memory via:

 48 #include "vhost.h"
(...)
363      err = copy_mbuf_to_desc(dev, descs, pkts[i], desc_idx, sz);

This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.

Comments