A detailed view of the vhost user protocol and its implementation in OVS DPDK, qemu and virtio-net
Environment
Red Hat OpenStack Platform 10
Open vSwitch 2.6.1
Issue
A detailed view of the vhost user protocol and its implementation in OVS DPDK, qemu and virtio-net
Resolution
Disclaimer: Links contained herein to external website(s) are provided for convenience only. Red Hat has not reviewed the links and is not responsible for the content or its availability. The inclusion of any link to an external website does not imply endorsement by Red Hat of the website or their entities, products or services. You agree that Red Hat is not responsible or liable for any loss or expenses that may result due to your use of (or reliance on) the external site or content.
A detailed view of the vhost user protocol and its implementation in OVS DPDK, qemu and virtio-net
Overview: How OVS DPDK and qemu communicate via the vhost user protocol
The vhost user protocol consists of a control path and a data path.
-
All control information is exchanged via a Unix socket. This includes information for exchanging memory mappings for direct memory access, as well as kicking / interrupting the other side if data is put into the virtio queue. The Unix socket, in neutron, is named
vhuxxxxxxxx-xx
. -
The actual dataplane is implemented via direct memory access. The virtio-net driver within the guest allocates part of the instance memory for the virtio queue. The structure of this queue is standardized in the virtio standard. Qemu shares this memory section’s address with OVS DPDK over the control channel. DPDK itself then maps the same standardized virtio queue structure onto this memory section and can thus directly read from and write to the virtio queue within the instance’s hugepage memory. This direct memory access is one of the reasons why both OVS DPDK and qemu need to use hugepage memory. If qemu is otherwise set up correctly, but lacks configuration for huge page memory, then OVS DPDK will not be able to access qemu’s memory and hence no packets can be exchanged. Users will notice this if they forget to request instance hugepages via nova’s metadata.
When OVS DPDK transmits towards the instance, these packets will show up within OVS DPDK’s statistics as Tx on port vhuxxxxxxxx-xx
. Within the instance, these packets show up as Rx.
When the instance transmits packets to OVS DPDK, then on the instance, these packets show up as Tx, and on OVS DPDK’s vhuxxxxxxxx-xx
port, they show up as Rx.
Note that the instance does not have “hardware” counters. ethtool -s
is not implemented. All low level counters do only show up within OVS (ovs-vsctl list get interfave vhuxxxxxxxx-xx statistics
) and report OVS DPDK’s perspective.
Although packets can be directly transmitted via shared memory, either side needs a means to tell the opposite side that a packet was copied into the virtio queue. This happens by kicking
the other side over the control plane which is implemented with the vhost user socket vhuxxxxxxxx-xx
. Kicking the other side comes at a cost. Firstly, a system call is needed to write to the socket. Secondly, an interrupt will have to be processed by the other side. Hence both sender and receiver spend costly extra time within the control channel.
In order to avoid costly kicks
via the control plane, both Open vSwitch and qemu can set specific flags to signal to the other side that they do not wish to receive an interrupt. However, they can only do so if they either temporarily or constantly poll the virtio queue.
For instance network performance this means that the optimal means of packet processing is DPDK within the instance itself. While Linux kernel networking (NAPI
) uses a mix of interrupt and poll mode processing, it is still exposed to a high number of interrupts. OVS DPDK sends packets towards the instance at very high rates. At the same time, the RX and TX buffers of qemu’s virtio queue are limited to a default of 256 and a maximum of 1024 entries. As a consequence, the instance itself needs to process packets very quickly. This is ideally achieved by constantly polling with a DPDK PMD on the instance’s interface.
The vhost user protocol
https://github.com/qemu/qemu/blob/master/docs/interop/vhost-user.txt
Vhost-user Protocol
===================
Copyright (c) 2014 Virtual Open Systems Sarl.
This work is licensed under the terms of the GNU GPL, version 2 or later.
See the COPYING file in the top-level directory.
===================
This protocol is aiming to complement the ioctl interface used to control the
vhost implementation in the Linux kernel. It implements the control plane needed
to establish virtqueue sharing with a user space process on the same host. It
uses communication over a Unix domain socket to share file descriptors in the
ancillary data of the message.
The protocol defines 2 sides of the communication, master and slave. Master is
the application that shares its virtqueues, in our case QEMU. Slave is the
consumer of the virtqueues.
In the current implementation QEMU is the Master, and the Slave is intended to
be a software Ethernet switch running in user space, such as Snabbswitch.
Master and slave can be either a client (i.e. connecting) or server (listening)
in the socket communication.
vhost user has since 2 sides:
-
Master - qemu
-
Slave - Open vSwitch or any other software switch
vhost user can run in 2 modes:
-
vhostuser-client - qemu is the server, the software switch is the client
-
vhostuser - the software switch is the server, qemu is the client
vhost user is based on the vhost architecture and implements all features in user space.
When a qemu instance boots, it will allocate all of the instance memory as shared hugepages. The OS' virtio paravirtualized driver will reserve part of this hugepage memory for holding the virtio ring buffer. This allows OVS DPDK to directly read from and write into the instance's virtio ring. Both OVS DPDK and qemu can directly exchange packets across this reserved memory section.
"The user space application will receive file descriptors for the pre-allocated shared guest RAM. It will directly access the related vrings in the guest's memory space" (http://www.virtualopensystems.com/en/solutions/guides/snabbswitch-qemu/).
For example, look at the following VM, mode vhostuser:
qemu 528828 0.1 0.0 2920084 34188 ? Sl Mar28 1:45 /usr/libexec/qemu-kvm -name guest=instance-00000028,debug-threads=on -S -object secret,id=masterKey0,format=raw,file=/var/lib/libvirt/qemu/domain-58-instance-00000028/master-key.aes -machine pc-i440fx-rhel7.4.0,accel=kvm,usb=off,dump-guest-core=off -cpu Skylake-Client,ss=on,hypervisor=on,tsc_adjust=on,pdpe1gb=on,mpx=off,xsavec=off,xgetbv1=off -m 2048 -realtime mlock=off -smp 8,sockets=4,cores=1,threads=2 -object memory-backend-file,id=ram-node0,prealloc=yes,mem-path=/dev/hugepages/libvirt/qemu/58-instance-00000028,share=yes,size=1073741824,host-nodes=0,policy=bind -numa node,nodeid=0,cpus=0-3,memdev=ram-node0 -object memory-backend-file,id=ram-node1,prealloc=yes,mem-path=/dev/hugepages/libvirt/qemu/58-instance-00000028,share=yes,size=1073741824,host-nodes=1,policy=bind -numa node,nodeid=1,cpus=4-7,memdev=ram-node1 -uuid 48888226-7b6b-415c-bcf7-b278ba0bca62 -smbios type=1,manufacturer=Red Hat,product=OpenStack Compute,version=14.1.0-3.el7ost,serial=3d5e138a-8193-41e4-ac95-de9bfc1a3ef1,uuid=48888226-7b6b-415c-bcf7-b278ba0bca62,family=Virtual Machine -no-user-config -nodefaults -chardev socket,id=charmonitor,path=/var/lib/libvirt/qemu/domain-58-instance-00000028/monitor.sock,server,nowait -mon chardev=charmonitor,id=monitor,mode=control -rtc base=utc,driftfix=slew -global kvm-pit.lost_tick_policy=delay -no-hpet -no-shutdown -boot strict=on -device piix3-usb-uhci,id=usb,bus=pci.0,addr=0x1.0x2 -drive file=/var/lib/nova/instances/48888226-7b6b-415c-bcf7-b278ba0bca62/disk,format=qcow2,if=none,id=drive-virtio-disk0,cache=none -device virtio-blk-pci,scsi=off,bus=pci.0,addr=0x4,drive=drive-virtio-disk0,id=virtio-disk0,bootindex=1 -chardev socket,id=charnet0,path=/var/run/openvswitch/vhuc26fd3c6-4b -netdev vhost-user,chardev=charnet0,queues=8,id=hostnet0 -device virtio-net-pci,mq=on,vectors=18,netdev=hostnet0,id=net0,mac=fa:16:3e:52:30:73,bus=pci.0,addr=0x3 -add-fd set=0,fd=33 -chardev file,id=charserial0,path=/dev/fdset/0,append=on -device isa-serial,chardev=charserial0,id=serial0 -chardev pty,id=charserial1 -device isa-serial,chardev=charserial1,id=serial1 -device usb-tablet,id=input0,bus=usb.0,port=1 -vnc 172.16.2.10:1 -k en-us -device cirrus-vga,id=video0,bus=pci.0,addr=0x2 -device virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x5 -msg timestamp=on
Qemu is instructed to allocate memory from the huge page pool and to make it shared memory (share=yes
):
-object memory-backend-file,id=ram-node0,prealloc=yes,mem-path=/dev/hugepages/libvirt/qemu/58-instance-00000028,share=yes,size=1073741824,host-nodes=0,policy=bind -numa node,nodeid=0,cpus=0-3,memdev=ram-node0 -object memory-backend-file,id=ram-node1,prealloc=yes,mem-path=/dev/hugepages/libvirt/qemu/58-instance-00000028,share=yes,size=1073741824,host-nodes=1,policy=bind
Simply copying packets into the other party's buffer is not enough, however. Additionally, vhost user uses a Unix domain socket (vhu[a-f0-9-]
) for communication between the vswitch and qemu, both during initialization and to kick
the other side when packets were copied into the virtio ring in shared memory. Interaction hence consists of a control path (vhu
socket) for setup and notification and a datapath (direct memory access) for moving the actual payload.
For the described Virtio mechanism to work, we need a setup interface to initialize the shared memory regions and exchange the event file descriptors. A Unix domain socket implements an API which allows us to do that. This straightforward socket interface can be used to initialize the userspace Virtio transport (vhost-user), in particular:
* Vrings are determined at initialization and are placed in shared memory between the two processed.
* For Virtio events (Vring kicks) we shall use eventfds that map to Vring events. This allows us compatibility with the QEMU/KVM implementation described in the next chapter, since KVM allows us to match events coming from virtio_pci in the guest with eventfds (ioeventfd and irqfd).
Sharing file descriptors between two processes differs than sharing them between a process and the kernel. One needs to use sendmsg over a Unix domain socket with SCM_RIGHTS set.
(http://www.virtualopensystems.com/en/solutions/guides/snabbswitch-qemu/)
In vhostuser mode, OVS creates the vhu
socket and qemu connects to it. in vhostuser client mode, qemu creates the vhu
socket and OVS connects to it.
In the above example instance with vhostuser mode, qemu is instructed to connect a netdev
of type vhost-user
to /var/run/openvswitch/vhuc26fd3c6-4b
:
-chardev socket,id=charnet0,path=/var/run/openvswitch/vhuc26fd3c6-4b -netdev vhost-user,chardev=charnet0,queues=8,id=hostnet0 -device virtio-net-pci,mq=on,vectors=18,netdev=hostnet0,id=net0,mac=fa:16:3e:52:30:73,bus=pci.0,addr=0x3
lsof
reveals that the socket is created by OVS:
[root@overcloud-compute-0 ~]# lsof -nn | grep vhuc26fd3c6-4b | awk '{print $1}' | uniq
ovs-vswit
vfio-sync
eal-intr-
lcore-sla
dpdk_watc
vhost_thr
ct_clean3
urcu4
handler12
handler13
handler14
handler15
revalidat
pmd189
pmd182
pmd187
pmd184
pmd185
pmd186
pmd183
pmd188
When a packet is copied into the virtio ring in shared memory by one of the participants, the other side either
-
does currently (e.g. Linux kernel's NAPI) or constantly (e.g. DPDK's PMD) poll the queue in case of which it will pick up new packets without further notice.
-
does not poll the queue and must be notified of the arrival of packets.
For the second case, the instance can be kicked
via the separate control path across the vhu
socket. The control path implements interrupts in user space by exchanging eventfd
objects. Note that writing to the socket requires system calls and will cause the PMDs to spend time in kernel space. The VM can switch off the control path by setting the VRING_AVAIL_F_NO_INTERRUPT
flag. Otherwise, Open vSwitch will kick
(interrupt) the VM whenever it puts new packets into the virtio ring.
Further details can be found in the following blog post: http://blog.vmsplice.net/2011/09/qemu-internals-vhost-architecture.html
Vhost as a userspace interface
One surprising aspect of the vhost architecture is that it is not tied to KVM in any way. Vhost is a userspace interface and has no dependency on the KVM kernel module. This means other userspace code, like libpcap, could in theory use vhost devices if they find them convenient high-performance I/O interfaces.
When a guest kicks the host because it has placed buffers onto a virtqueue, there needs to be a way to signal the vhost worker thread that there is work to do. Since vhost does not depend on the KVM kernel module they cannot communicate directly. Instead vhost instances are set up with an eventfd file descriptor which the vhost worker thread watches for activity. The KVM kernel module has a feature known as ioeventfd for taking an eventfd and hooking it up to a particular guest I/O exit. QEMU userspace registers an ioeventfd for the VIRTIO_PCI_QUEUE_NOTIFY hardware register access which kicks the virtqueue. This is how the vhost worker thread gets notified by the KVM kernel module when the guest kicks the virtqueue.
On the return trip from the vhost worker thread to interrupting the guest a similar approach is used. Vhost takes a "call" file descriptor which it will write to in order to kick the guest. The KVM kernel module has a feature called irqfd which allows an eventfd to trigger guest interrupts. QEMU userspace registers an irqfd for the virtio PCI device interrupt and hands it to the vhost instance. This is how the vhost worker thread can interrupt the guest.
In the end the vhost instance only knows about the guest memory mapping, a kick eventfd, and a call eventfd.
Where to find out more
Here are the main points to begin exploring the code:
drivers/vhost/vhost.c - common vhost driver code
drivers/vhost/net.c - vhost-net driver
virt/kvm/eventfd.c - ioeventfd and irqfd
The QEMU userspace code shows how to initialize the vhost instance:
hw/vhost.c - common vhost initialization code
hw/vhost_net.c - vhost-net initialization
The datapath - direct memory access
How memory is mapped for the virtq
The virtio standard defines exactly what a virtq should look like.
2.4 Virtqueues
The mechanism for bulk data transport on virtio devices is pretentiously called a virtqueue. Each device can have zero or more virtqueues. Each queue has a 16-bit queue size parameter, which sets the number of entries and implies the total size of the queue.
Each virtqueue consists of three parts:
Descriptor Table
Available Ring
Used Ring
http://docs.oasis-open.org/virtio/virtio/v1.0/virtio-v1.0.html
The standard exactly defines the structure of the descriptor table, available ring and used ring. For example, for the available ring:
2.4.6 The Virtqueue Available Ring
struct virtq_avail {
#define VIRTQ_AVAIL_F_NO_INTERRUPT 1
le16 flags;
le16 idx;
le16 ring[ /* Queue Size */ ];
le16 used_event; /* Only if VIRTIO_F_EVENT_IDX */
};
The driver uses the available ring to offer buffers to the device: each ring entry refers to the head of a descriptor chain. It is only written by the driver and read by the device.
idx field indicates where the driver would put the next descriptor entry in the ring (modulo the queue size). This starts at 0, and increases. Note: The legacy [Virtio PCI Draft] referred to this structure as vring_avail, and the constant as VRING_AVAIL_F_NO_INTERRUPT, but the layout and value were identical.
http://docs.oasis-open.org/virtio/virtio/v1.0/virtio-v1.0.html
In order to make direct memory access possible, DPDK implements the above standard.
dpdk-stable-16.11.4/drivers/net/virtio/virtio_ring.h
48 /* The Host uses this in used->flags to advise the Guest: don't kick me
49 * when you add a buffer. It's unreliable, so it's simply an
50 * optimization. Guest will still kick if it's out of buffers. */
51 #define VRING_USED_F_NO_NOTIFY 1
52 /* The Guest uses this in avail->flags to advise the Host: don't
53 * interrupt me when you consume a buffer. It's unreliable, so it's
54 * simply an optimization. */
55 #define VRING_AVAIL_F_NO_INTERRUPT 1
56
57 /* VirtIO ring descriptors: 16 bytes.
58 * These can chain together via "next". */
59 struct vring_desc {
60 uint64_t addr; /* Address (guest-physical). */
61 uint32_t len; /* Length. */
62 uint16_t flags; /* The flags as indicated above. */
63 uint16_t next; /* We chain unused descriptors via this. */
64 };
65
66 struct vring_avail {
67 uint16_t flags;
68 uint16_t idx;
69 uint16_t ring[0];
70 };
71
72 /* id is a 16bit index. uint32_t is used here for ids for padding reasons. */
73 struct vring_used_elem {
74 /* Index of start of used descriptor chain. */
75 uint32_t id;
76 /* Total length of the descriptor chain which was written to. */
77 uint32_t len;
78 };
79
80 struct vring_used {
81 uint16_t flags;
82 volatile uint16_t idx;
83 struct vring_used_elem ring[0];
84 };
85
86 struct vring {
87 unsigned int num;
88 struct vring_desc *desc;
89 struct vring_avail *avail;
90 struct vring_used *used;
91 };
dpdk-stable-16.11.4/lib/librte_vhost/vhost.h
81 struct vhost_virtqueue {
82 struct vring_desc *desc;
83 struct vring_avail *avail;
84 struct vring_used *used;
85 uint32_t size;
86
87 uint16_t last_avail_idx;
88 uint16_t last_used_idx;
89 #define VIRTIO_INVALID_EVENTFD (-1)
90 #define VIRTIO_UNINITIALIZED_EVENTFD (-2)
91
92 /* Backend value to determine if device should started/stopped */
93 int backend;
94 /* Used to notify the guest (trigger interrupt) */
95 int callfd;
96 /* Currently unused as polling mode is enabled */
97 int kickfd;
98 int enabled;
99
100 /* Physical address of used ring, for logging */
101 uint64_t log_guest_addr;
102
103 uint16_t nr_zmbuf;
104 uint16_t zmbuf_size;
105 uint16_t last_zmbuf_idx;
106 struct zcopy_mbuf *zmbufs;
107 struct zcopy_mbuf_list zmbuf_list;
108
109 struct vring_used_elem *shadow_used_ring;
110 uint16_t shadow_used_idx;
111 } __rte_cache_aligned;
Once the memory mapping is done, DPDK can directly act on and manipulate the same structures as virtio-net within the guest's shared memory.
The control path - Unix sockets
qemu and DPDK message exchange over vhost user socket
DPDK and qemu communicate via the standardized vhost-user protocol.
The message types are:
dpdk-stable-16.11.4/lib/librte_vhost/vhost_user.h
54 typedef enum VhostUserRequest {
55 VHOST_USER_NONE = 0,
56 VHOST_USER_GET_FEATURES = 1,
57 VHOST_USER_SET_FEATURES = 2,
58 VHOST_USER_SET_OWNER = 3,
59 VHOST_USER_RESET_OWNER = 4,
60 VHOST_USER_SET_MEM_TABLE = 5,
61 VHOST_USER_SET_LOG_BASE = 6,
62 VHOST_USER_SET_LOG_FD = 7,
63 VHOST_USER_SET_VRING_NUM = 8,
64 VHOST_USER_SET_VRING_ADDR = 9,
65 VHOST_USER_SET_VRING_BASE = 10,
66 VHOST_USER_GET_VRING_BASE = 11,
67 VHOST_USER_SET_VRING_KICK = 12,
68 VHOST_USER_SET_VRING_CALL = 13,
69 VHOST_USER_SET_VRING_ERR = 14,
70 VHOST_USER_GET_PROTOCOL_FEATURES = 15,
71 VHOST_USER_SET_PROTOCOL_FEATURES = 16,
72 VHOST_USER_GET_QUEUE_NUM = 17,
73 VHOST_USER_SET_VRING_ENABLE = 18,
74 VHOST_USER_SEND_RARP = 19,
75 VHOST_USER_MAX
76 } VhostUserRequest;
Further details about the message types can be found in qemu's source code in: https://github.com/qemu/qemu/blob/master/docs/interop/vhost-user.txt
DPDK processes incoming messages with …
dpdk-stable-16.11.4/lib/librte_vhost/vhost_user.c
920 int
921 vhost_user_msg_handler(int vid, int fd)
922 {
… which uses:
dpdk-stable-16.11.4/lib/librte_vhost/vhost_user.c
:
872 /* return bytes# of read on success or negative val on failure. */
873 static int
874 read_vhost_message(int sockfd, struct VhostUserMsg *msg)
875 {
DPDK writes outgoing messages with:
dpdk-stable-16.11.4/lib/librte_vhost/vhost_user.c
902 static int
903 send_vhost_message(int sockfd, struct VhostUserMsg *msg)
904 {
qemu has an equivalent method for receiving:
qemu-2.9.0/contrib/libvhost-user/libvhost-user.c
746 static bool
747 vu_process_message(VuDev *dev, VhostUserMsg *vmsg)
And qemu obviously also has an equivalent method for sending:
qemu-2.9.0/hw/virtio/vhost-user.c
198 /* most non-init callers ignore the error */
199 static int vhost_user_write(struct vhost_dev *dev, VhostUserMsg *msg,
200 int *fds, int fd_num)
201 {
How DPDK registers the Unix socket and uses it for message exchange
neutron instructs Open vSwitch to create a port with name vhuxxxxxxxx-xx. Within OVS, this name is saved in the netdev
structure as netdev->name
.
When it creates the vhost user port, Open vSwitch instructs DPDK to register a new vhost-user socket. The socket's path is set as dev->vhost_id
which is a concatenation of vhost_sock_dir
and netdev->name
.
OVS can request to create the socket in vhost user client mode by passing the RTE_VHOST_USER_CLIENT
flag.
OVS' netdev_dpdk_vhost_construct
method calls DPDK's rte_vhost_driver_register
method, which in turn executes vhost_user_create_server
or vhost_user_create_client
. By default, vhost user server mode is used, or if RTE_VHOST_USER_CLIENT
is set, vhost user client mode.
Overview of the involved methods:
OVS
netdev_dpdk_vhost_construct
(struct netdev *netdev)
|
|
DPDK V
rte_vhost_driver_register
(const char *path, uint64_t flags)
|
-----------------------------------------------
| |
V |
vhost_user_create_server |
(struct vhost_user_socket *vsocket) |
| |
V V
vhost_user_server_new_connection vhost_user_create_client vhost_user_client_reconnect
(int fd, void *dat, int *remove __rte_unused) (struct vhost_user_socket *vsocket) (void *arg __rte_unused)
| | |
V V V
--------------------------------------------------------------------------------------------------
|
V
vhost_user_add_connection
(int fd, struct vhost_user_socket *vsocket)
|
V
vhost_user_read_cb
(int connfd, void *dat, int *remove)
|
V
vhost_user_msg_handler
netdev_dpdk_vhost_construct
is in openvswitch-2.6.1/lib/netdev-dpdk.c:
886 static int
887 netdev_dpdk_vhost_construct(struct netdev *netdev)
888 {
889 struct netdev_dpdk *dev = netdev_dpdk_cast(netdev);
890 const char *name = netdev->name;
891 int err;
892
893 /* 'name' is appended to 'vhost_sock_dir' and used to create a socket in
894 * the file system. '/' or '\' would traverse directories, so they're not
895 * acceptable in 'name'. */
896 if (strchr(name, '/') || strchr(name, '\\')) {
897 VLOG_ERR("\"%s\" is not a valid name for a vhost-user port. "
898 "A valid name must not include '/' or '\\'",
899 name);
900 return EINVAL;
901 }
902
903 if (rte_eal_init_ret) {
904 return rte_eal_init_ret;
905 }
906
907 ovs_mutex_lock(&dpdk_mutex);
908 /* Take the name of the vhost-user port and append it to the location where
909 * the socket is to be created, then register the socket.
910 */
911 snprintf(dev->vhost_id, sizeof dev->vhost_id, "%s/%s",
912 vhost_sock_dir, name);
913
914 dev->vhost_driver_flags &= ~RTE_VHOST_USER_CLIENT;
915 err = rte_vhost_driver_register(dev->vhost_id, dev->vhost_driver_flags);
916 if (err) {
917 VLOG_ERR("vhost-user socket device setup failure for socket %s\n",
918 dev->vhost_id);
919 } else {
920 fatal_signal_add_file_to_unlink(dev->vhost_id);
921 VLOG_INFO("Socket %s created for vhost-user port %s\n",
922 dev->vhost_id, name);
923 }
924 err = netdev_dpdk_init(netdev, -1, DPDK_DEV_VHOST);
925
926 ovs_mutex_unlock(&dpdk_mutex);
927 return err;
928 }
netdev_dpdk_vhost_construct
calls rte_vhost_driver_register
. All of the following code is in dpdk-stable-16.11.4/lib/librte_vhost/socket.c
:
494 /*
495 * Register a new vhost-user socket; here we could act as server
496 * (the default case), or client (when RTE_VHOST_USER_CLIENT) flag
497 * is set.
498 */
499 int
500 rte_vhost_driver_register(const char *path, uint64_t flags)
501 {
(...)
525 if ((flags & RTE_VHOST_USER_CLIENT) != 0) {
526 vsocket->reconnect = !(flags & RTE_VHOST_USER_NO_RECONNECT);
527 if (vsocket->reconnect && reconn_tid == 0) {
528 if (vhost_user_reconnect_init() < 0) {
529 free(vsocket->path);
530 free(vsocket);
531 goto out;
532 }
533 }
534 ret = vhost_user_create_client(vsocket);
535 } else {
536 vsocket->is_server = true;
537 ret = vhost_user_create_server(vsocket);
538 }
vhost_user_create_server
calls vhost_user_server_new_connection
:
304 static int
305 vhost_user_create_server(struct vhost_user_socket *vsocket)
306 {
307 int fd;
308 int ret;
309 struct sockaddr_un un;
310 const char *path = vsocket->path;
311
312 fd = create_unix_socket(path, &un, vsocket->is_server);
And any of the 3 following methods calls vhost_user_add_connection
:
239 /* call back when there is new vhost-user connection from client */
240 static void
241 vhost_user_server_new_connection(int fd, void *dat, int *remove __rte_unused)
242 {
(...)
386 static void *
387 vhost_user_client_reconnect(void *arg __rte_unused)
388 {
(...)
447 static int
448 vhost_user_create_client(struct vhost_user_socket *vsocket)
449 {
(...)
190 static void
191 vhost_user_add_connection(int fd, struct vhost_user_socket *vsocket)
192 {
vhost_user_add_connection
then executes vhost_user_read_cb
which in turn runs vhost_user_msg_handler
for incoming message handling.
253 static void
254 vhost_user_read_cb(int connfd, void *dat, int *remove)
255 {
256 struct vhost_user_connection *conn = dat;
257 struct vhost_user_socket *vsocket = conn->vsocket;
258 int ret;
259
260 ret = vhost_user_msg_handler(conn->vid, connfd);
261 if (ret < 0) {
262 close(connfd);
263 *remove = 1;
264 vhost_destroy_device(conn->vid);
265
266 pthread_mutex_lock(&vsocket->conn_mutex);
267 TAILQ_REMOVE(&vsocket->conn_list, conn, next);
268 pthread_mutex_unlock(&vsocket->conn_mutex);
269
270 free(conn);
271
272 if (vsocket->reconnect)
273 vhost_user_create_client(vsocket);
274 }
275 }
dpdk-stable-16.11.4/lib/librte_vhost/vhost_user.c
920 int
921 vhost_user_msg_handler(int vid, int fd)
922 {
923 struct virtio_net *dev;
924 struct VhostUserMsg msg;
925 int ret;
926
927 dev = get_device(vid);
928 if (dev == NULL)
929 return -1;
930
931 ret = read_vhost_message(fd, &msg);
932 if (ret <= 0 || msg.request >= VHOST_USER_MAX) {
933 if (ret < 0)
934 RTE_LOG(ERR, VHOST_CONFIG,
935 "vhost read message failed\n");
936 else if (ret == 0)
937 RTE_LOG(INFO, VHOST_CONFIG,
938 "vhost peer closed\n");
939 else
940 RTE_LOG(ERR, VHOST_CONFIG,
941 "vhost read incorrect message\n");
942
943 return -1;
944 }
945
946 RTE_LOG(INFO, VHOST_CONFIG, "read message %s\n",
947 vhost_message_str[msg.request]);
948 switch (msg.request) {
949 case VHOST_USER_GET_FEATURES:
950 msg.payload.u64 = vhost_user_get_features();
951 msg.size = sizeof(msg.payload.u64);
952 send_vhost_message(fd, &msg);
953 break;
954 case VHOST_USER_SET_FEATURES:
955 vhost_user_set_features(dev, msg.payload.u64);
956 break;
957
958 case VHOST_USER_GET_PROTOCOL_FEATURES:
959 msg.payload.u64 = VHOST_USER_PROTOCOL_FEATURES;
960 msg.size = sizeof(msg.payload.u64);
961 send_vhost_message(fd, &msg);
962 break;
963 case VHOST_USER_SET_PROTOCOL_FEATURES:
964 vhost_user_set_protocol_features(dev, msg.payload.u64);
965 break;
966
967 case VHOST_USER_SET_OWNER:
968 vhost_user_set_owner();
969 break;
970 case VHOST_USER_RESET_OWNER:
971 vhost_user_reset_owner(dev);
972 break;
973
974 case VHOST_USER_SET_MEM_TABLE:
975 vhost_user_set_mem_table(dev, &msg);
976 break;
977
978 case VHOST_USER_SET_LOG_BASE:
979 vhost_user_set_log_base(dev, &msg);
980
981 /* it needs a reply */
982 msg.size = sizeof(msg.payload.u64);
983 send_vhost_message(fd, &msg);
984 break;
985 case VHOST_USER_SET_LOG_FD:
986 close(msg.fds[0]);
987 RTE_LOG(INFO, VHOST_CONFIG, "not implemented.\n");
988 break;
989
990 case VHOST_USER_SET_VRING_NUM:
991 vhost_user_set_vring_num(dev, &msg.payload.state);
992 break;
993 case VHOST_USER_SET_VRING_ADDR:
994 vhost_user_set_vring_addr(dev, &msg.payload.addr);
995 break;
996 case VHOST_USER_SET_VRING_BASE:
997 vhost_user_set_vring_base(dev, &msg.payload.state);
998 break;
999
1000 case VHOST_USER_GET_VRING_BASE:
1001 ret = vhost_user_get_vring_base(dev, &msg.payload.state);
1002 msg.size = sizeof(msg.payload.state);
1003 send_vhost_message(fd, &msg);
1004 break;
1005
1006 case VHOST_USER_SET_VRING_KICK:
1007 vhost_user_set_vring_kick(dev, &msg);
1008 break;
1009 case VHOST_USER_SET_VRING_CALL:
1010 vhost_user_set_vring_call(dev, &msg);
1011 break;
1012
1013 case VHOST_USER_SET_VRING_ERR:
1014 if (!(msg.payload.u64 & VHOST_USER_VRING_NOFD_MASK))
1015 close(msg.fds[0]);
1016 RTE_LOG(INFO, VHOST_CONFIG, "not implemented\n");
1017 break;
1018
1019 case VHOST_USER_GET_QUEUE_NUM:
1020 msg.payload.u64 = VHOST_MAX_QUEUE_PAIRS;
1021 msg.size = sizeof(msg.payload.u64);
1022 send_vhost_message(fd, &msg);
1023 break;
1024
1025 case VHOST_USER_SET_VRING_ENABLE:
1026 vhost_user_set_vring_enable(dev, &msg.payload.state);
1027 break;
1028 case VHOST_USER_SEND_RARP:
1029 vhost_user_send_rarp(dev, &msg);
1030 break;
1031
1032 default:
1033 break;
1034
1035 }
1036
1037 return 0;
1038 }
How virtio communicates the virtio queue’s memory addresses to DPDK
DPDK uses a method called vhost_user_set_vring_addr
to convert virtio's desc, used and avail ring addresses to its own address space.
dpdk-stable-16.11.4/lib/librte_vhost/vhost_user.c
324 /*
325 * The virtio device sends us the desc, used and avail ring addresses.
326 * This function then converts these to our address space.
327 */
328 static int
329 vhost_user_set_vring_addr(struct virtio_net *dev, struct vhost_vring_addr *addr)
330 {
331 struct vhost_virtqueue *vq;
332
333 if (dev->mem == NULL)
334 return -1;
335
336 /* addr->index refers to the queue index. The txq 1, rxq is 0. */
337 vq = dev->virtqueue[addr->index];
338
339 /* The addresses are converted from QEMU virtual to Vhost virtual. */
340 vq->desc = (struct vring_desc *)(uintptr_t)qva_to_vva(dev,
341 addr->desc_user_addr);
342 if (vq->desc == 0) {
343 RTE_LOG(ERR, VHOST_CONFIG,
344 "(%d) failed to find desc ring address.\n",
345 dev->vid);
346 return -1;
347 }
348
349 dev = numa_realloc(dev, addr->index);
350 vq = dev->virtqueue[addr->index];
351
352 vq->avail = (struct vring_avail *)(uintptr_t)qva_to_vva(dev,
353 addr->avail_user_addr);
354 if (vq->avail == 0) {
355 RTE_LOG(ERR, VHOST_CONFIG,
356 "(%d) failed to find avail ring address.\n",
357 dev->vid);
358 return -1;
359 }
360
361 vq->used = (struct vring_used *)(uintptr_t)qva_to_vva(dev,
362 addr->used_user_addr);
363 if (vq->used == 0) {
364 RTE_LOG(ERR, VHOST_CONFIG,
365 "(%d) failed to find used ring address.\n",
366 dev->vid);
367 return -1;
368 }
369
370 if (vq->last_used_idx != vq->used->idx) {
371 RTE_LOG(WARNING, VHOST_CONFIG,
372 "last_used_idx (%u) and vq->used->idx (%u) mismatches; "
373 "some packets maybe resent for Tx and dropped for Rx\n",
374 vq->last_used_idx, vq->used->idx);
375 vq->last_used_idx = vq->used->idx;
376 vq->last_avail_idx = vq->used->idx;
377 }
378
379 vq->log_guest_addr = addr->log_guest_addr;
380
381 LOG_DEBUG(VHOST_CONFIG, "(%d) mapped address desc: %p\n",
382 dev->vid, vq->desc);
383 LOG_DEBUG(VHOST_CONFIG, "(%d) mapped address avail: %p\n",
384 dev->vid, vq->avail);
385 LOG_DEBUG(VHOST_CONFIG, "(%d) mapped address used: %p\n",
386 dev->vid, vq->used);
387 LOG_DEBUG(VHOST_CONFIG, "(%d) log_guest_addr: %" PRIx64 "\n",
388 dev->vid, vq->log_guest_addr);
389
390 return 0;
391 }
The memory will be set if a message of type VHOST_USER_SET_VRING_ADDR
arrives via the vhu
socket:
dpdk-stable-16.11.4/lib/librte_vhost/vhost_user.c
920 int
921 vhost_user_msg_handler(int vid, int fd)
922 {
923 struct virtio_net *dev;
924 struct VhostUserMsg msg;
925 int ret;
926
927 dev = get_device(vid);
928 if (dev == NULL)
929 return -1;
930
931 ret = read_vhost_message(fd, &msg);
932 if (ret <= 0 || msg.request >= VHOST_USER_MAX) {
933 if (ret < 0)
934 RTE_LOG(ERR, VHOST_CONFIG,
935 "vhost read message failed\n");
936 else if (ret == 0)
937 RTE_LOG(INFO, VHOST_CONFIG,
938 "vhost peer closed\n");
939 else
940 RTE_LOG(ERR, VHOST_CONFIG,
941 "vhost read incorrect message\n");
942
943 return -1;
944 }
945
946 RTE_LOG(INFO, VHOST_CONFIG, "read message %s\n",
947 vhost_message_str[msg.request]);
948 switch (msg.request) {
949 case VHOST_USER_GET_FEATURES:
950 msg.payload.u64 = vhost_user_get_features();
951 msg.size = sizeof(msg.payload.u64);
952 send_vhost_message(fd, &msg);
953 break;
(...)
993 case VHOST_USER_SET_VRING_ADDR:
994 vhost_user_set_vring_addr(dev, &msg.payload.addr);
995 break;
As a matter of fact, qemu has an equivalent to that same method in DPDK:
qemu-2.9.0/contrib/libvhost-user/libvhost-user.c
746 static bool
747 vu_process_message(VuDev *dev, VhostUserMsg *vmsg)
748 {
749 int do_reply = 0;
750
751 /* Print out generic part of the request. */
752 DPRINT("================ Vhost user message ================\n");
753 DPRINT("Request: %s (%d)\n", vu_request_to_string(vmsg->request),
754 vmsg->request);
755 DPRINT("Flags: 0x%x\n", vmsg->flags);
756 DPRINT("Size: %d\n", vmsg->size);
757
758 if (vmsg->fd_num) {
759 int i;
760 DPRINT("Fds:");
761 for (i = 0; i < vmsg->fd_num; i++) {
762 DPRINT(" %d", vmsg->fds[i]);
763 }
764 DPRINT("\n");
765 }
766
767 if (dev->iface->process_msg &&
768 dev->iface->process_msg(dev, vmsg, &do_reply)) {
769 return do_reply;
770 }
771
772 switch (vmsg->request) {
773 case VHOST_USER_GET_FEATURES:
774 return vu_get_features_exec(dev, vmsg);
(...)
793 case VHOST_USER_SET_VRING_ADDR:
794 return vu_set_vring_addr_exec(dev, vmsg);
(...)
Obviously, it also has a method to communicate that address via the socket to DPDK
qemu-2.9.0/hw/virtio/vhost-user.c
:
329 static int vhost_user_set_vring_addr(struct vhost_dev *dev,
330 struct vhost_vring_addr *addr)
331 {
332 VhostUserMsg msg = {
333 .request = VHOST_USER_SET_VRING_ADDR,
334 .flags = VHOST_USER_VERSION,
335 .payload.addr = *addr,
336 .size = sizeof(msg.payload.addr),
337 };
338
339 if (vhost_user_write(dev, &msg, NULL, 0) < 0) {
340 return -1;
341 }
342
343 return 0;
344 }
How does OVS DPDK with vhost user transmit packets into an instance and when to Tx drops occur?
The code that handles Tx from OVS DPDK to the instance is in __netdev_dpdk_vhost_send()
function in the lib/netdev-dpdk.c
.
OVS tries to send and if there is no more space but there was a progress, it retries for VHOST_ENQ_RETRY_NUM
(by default 8
) times. If there was no progress on the first try (no packets pushed to the ring), or it exceeds VHOST_ENQ_RETRY_NUM
times, then it drops all remaining packets in the batch (up to cnt
which is a maximum of 32
packets).
1520 do {
1523
1524 tx_pkts = rte_vhost_enqueue_burst(netdev_dpdk_get_vid(dev),
1525 vhost_qid, cur_pkts, cnt);
1526 if (OVS_LIKELY(tx_pkts)) {
1527 /* Packets have been sent.*/
1528 cnt -= tx_pkts;
1529 /* Prepare for possible retry.*/
1530 cur_pkts = &cur_pkts[tx_pkts];
1531 } else {
1532 /* No packets sent - do not retry.*/
1533 break;
1534 }
1535 } while (cnt && (retries++ <= VHOST_ENQ_RETRY_NUM));
1536
1545 for (i = 0; i < total_pkts - dropped; i++) {
1546 dp_packet_delete(pkts[i]);
1547 }
Instance Rx interrupt handling
When OVS DPDK puts new data into virtio's ring, two possible scenarios exist:
-
the instance is not polling its queues and hence needs to be made aware that new packets arrived
-
the instance is currently polling and hence there is no need to tell it that new data is in the ring
Within an instance that uses Linux kernel networking, the Linux Networking stack uses NAPI which is a mix of interrupt and polling modes. The guest OS starts in interrupt mode, so it does nothing until the first interrupt comes in. When that happens, the CPU quickly ACKs the IRQ and schedules the ksoftirqd thread to run the callback. It will also disable any further IRQs.
When ksoftirqd runs, it will try to poll as many packets as possible up to the netdev_budget
. If there are more packets in the queue, then ksoftirqd will reschedule itself, and repeat the operation until no more packets are available. Note that it is polling and there is no need for further interrupts. When no more packets are available, ksoftirqd will stop polling and re-enable the IRQ to wait for the next packet/interrupt.
When the instance is polling, CPU caches are hot, low extra latency or delays, and the right processes in the host and in the VM are running, further reducing latency. Additionally, in order for the host to send an IRQ to the guest, it needs to write to the Unix socket (syscall), which is expensive, and adds more to the extra delays/cost.
The advantage of running DPDK within the instance as part of the NFV application lies in how PMDs handle traffic: because PMDs are constantly polling, they can switch off the interrupt constantly and hence OVS DPDK does not have to kick the VM any more. This saves OVS DPDK from having to write to the socket and hence spares it from having to execute a write system call to the kernel. This is a win-win situation: OVS DPDK's PMD remains in user space and instances can process packets faster because they skip the extra overhead of being kicked through the control plan and interrupt handling.
The guest can receive an interrupt if flag VRING_AVAIL_F_NO_INTERRUPT
is not set. This interrupt is transmitted to the guest via callfd
and the OS' eventfd
object.
The guest OS can hence enable or disable interrupts. When the guest disables interrupts on the virtio interface then virtio-net will translate this and will use flag VRING_AVAIL_F_NO_INTERRUPT
which is defined in both DPDK and qemu:
[root@overcloud-compute-0 SOURCES]# grep VRING_AVAIL_F_NO_INTERRUPT -R | grep def
dpdk-stable-16.11.5/drivers/net/virtio/virtio_ring.h:#define VRING_AVAIL_F_NO_INTERRUPT 1
dpdk-stable-17.11/drivers/net/virtio/virtio_ring.h:#define VRING_AVAIL_F_NO_INTERRUPT 1
[root@overcloud-compute-0 SOURCES]#
[root@overcloud-compute-0 qemu]# grep AVAIL_F_NO_INTERRUPT -R -i | grep def
qemu-2.9.0/roms/SLOF/lib/libvirtio/virtio.h:#define VRING_AVAIL_F_NO_INTERRUPT 1
qemu-2.9.0/roms/ipxe/src/include/ipxe/virtio-ring.h:#define VRING_AVAIL_F_NO_INTERRUPT 1
qemu-2.9.0/roms/seabios/src/hw/virtio-ring.h:#define VRING_AVAIL_F_NO_INTERRUPT 1
qemu-2.9.0/include/standard-headers/linux/virtio_ring.h:#define VRING_AVAIL_F_NO_INTERRUPT 1
Once the bit for VRING_AVAIL_F_NO_INTERRUPT
in vq->avail->flags
is set, this will instruct DPDK not to kick the instance. From dpdk-stable-16.11.4/lib/librte_vhost/virtio_net.c
:
(...)
/* Kick the guest if necessary. */
if (!(vq->avail->flags & VRING_AVAIL_F_NO_INTERRUPT)
&& (vq->callfd >= 0))
eventfd_write(vq->callfd, (eventfd_t)1);
return count;
(...)
Note that, as explained earlier, this also saves the PMD from having to execute a system call.
A more detailed look at the code - OVS DPDK Tx towards the instance
Further investigation can be done in the code in lib/netdev-dpdk.c
:
1487 static void
1488 __netdev_dpdk_vhost_send(struct netdev *netdev, int qid,
1489 struct dp_packet **pkts, int cnt)
1490 {
1491 struct netdev_dpdk *dev = netdev_dpdk_cast(netdev);
1492 struct rte_mbuf **cur_pkts = (struct rte_mbuf **) pkts;
1493 unsigned int total_pkts = cnt;
1494 unsigned int dropped = 0;
1495 int i, retries = 0;
1496
1497 qid = dev->tx_q[qid % netdev->n_txq].map;
1498
1499 if (OVS_UNLIKELY(!is_vhost_running(dev) || qid < 0
1500 || !(dev->flags & NETDEV_UP))) {
1501 rte_spinlock_lock(&dev->stats_lock);
1502 dev->stats.tx_dropped+= cnt;
1503 rte_spinlock_unlock(&dev->stats_lock);
1504 goto out;
1505 }
1506
1507 rte_spinlock_lock(&dev->tx_q[qid].tx_lock);
1508
1509 cnt = netdev_dpdk_filter_packet_len(dev, cur_pkts, cnt);
1510 /* Check has QoS has been configured for the netdev */
1511 cnt = netdev_dpdk_qos_run__(dev, cur_pkts, cnt);
1512 dropped = total_pkts - cnt;
1513
1514 do {
1515 int vhost_qid = qid * VIRTIO_QNUM + VIRTIO_RXQ;
1516 unsigned int tx_pkts;
1517
1518 tx_pkts = rte_vhost_enqueue_burst(netdev_dpdk_get_vid(dev),
1519 vhost_qid, cur_pkts, cnt);
1520 if (OVS_LIKELY(tx_pkts)) {
1521 /* Packets have been sent.*/
1522 cnt -= tx_pkts;
1523 /* Prepare for possible retry.*/
1524 cur_pkts = &cur_pkts[tx_pkts];
1525 } else {
1526 /* No packets sent - do not retry.*/
1527 break;
1528 }
1529 } while (cnt && (retries++ <= VHOST_ENQ_RETRY_NUM));
1530
1531 rte_spinlock_unlock(&dev->tx_q[qid].tx_lock);
1532
1533 rte_spinlock_lock(&dev->stats_lock);
1534 netdev_dpdk_vhost_update_tx_counters(&dev->stats, pkts, total_pkts,
1535 cnt + dropped);
1536 rte_spinlock_unlock(&dev->stats_lock);
1537
1538 out:
1539 for (i = 0; i < total_pkts - dropped; i++) {
1540 dp_packet_delete(pkts[i]);
1541 }
1542 }
The work is executed here:
do {
int vhost_qid = qid * VIRTIO_QNUM + VIRTIO_RXQ;
unsigned int tx_pkts;
tx_pkts = rte_vhost_enqueue_burst(netdev_dpdk_get_vid(dev),
vhost_qid, cur_pkts, cnt);
if (OVS_LIKELY(tx_pkts)) {
/* Packets have been sent.*/
cnt -= tx_pkts;
/* Prepare for possible retry.*/
cur_pkts = &cur_pkts[tx_pkts];
} else {
/* No packets sent - do not retry.*/
break;
}
} while (cnt && (retries++ <= VHOST_ENQ_RETRY_NUM));
Method rte_vhost_enqueue_burst
comes directly from DPDK's vhost library:
[root@overcloud-compute-0 src]# grep rte_vhost_enqueue_burst dpdk-stable-16.11.4/ -R
dpdk-stable-16.11.4/doc/guides/prog_guide/vhost_lib.rst:* ``rte_vhost_enqueue_burst(vid, queue_id, pkts, count)``
dpdk-stable-16.11.4/doc/guides/rel_notes/release_16_07.rst:* The function ``rte_vhost_enqueue_burst`` no longer supports concurrent enqueuing
dpdk-stable-16.11.4/drivers/net/vhost/rte_eth_vhost.c: nb_tx = rte_vhost_enqueue_burst(r->vid,
dpdk-stable-16.11.4/examples/tep_termination/vxlan_setup.c: ret = rte_vhost_enqueue_burst(vid, VIRTIO_RXQ, pkts_valid, count);
dpdk-stable-16.11.4/examples/vhost/main.c: ret = rte_vhost_enqueue_burst(dst_vdev->vid, VIRTIO_RXQ, &m, 1);
dpdk-stable-16.11.4/examples/vhost/main.c: enqueue_count = rte_vhost_enqueue_burst(vdev->vid, VIRTIO_RXQ,
dpdk-stable-16.11.4/lib/librte_vhost/rte_vhost_version.map: rte_vhost_enqueue_burst;
dpdk-stable-16.11.4/lib/librte_vhost/rte_virtio_net.h:uint16_t rte_vhost_enqueue_burst(int vid, uint16_t queue_id,
dpdk-stable-16.11.4/lib/librte_vhost/virtio_net.c:rte_vhost_enqueue_burst(int vid, uint16_t queue_id,
dpdk-stable-16.11.4/lib/librte_vhost/rte_virtio_net.h
/**
* This function adds buffers to the virtio devices RX virtqueue. Buffers can
* be received from the physical port or from another virtual device. A packet
* count is returned to indicate the number of packets that were succesfully
* added to the RX queue.
* @param vid
* virtio-net device ID
* @param queue_id
* virtio queue index in mq case
* @param pkts
* array to contain packets to be enqueued
* @param count
* packets num to be enqueued
* @return
* num of packets enqueued
*/
uint16_t rte_vhost_enqueue_burst(int vid, uint16_t queue_id,
struct rte_mbuf **pkts, uint16_t count);
dpdk-stable-16.11.4/lib/librte_vhost/virtio_net.c
uint16_t
rte_vhost_enqueue_burst(int vid, uint16_t queue_id,
struct rte_mbuf **pkts, uint16_t count)
{
struct virtio_net *dev = get_device(vid);
if (!dev)
return 0;
if (dev->features & (1 << VIRTIO_NET_F_MRG_RXBUF))
return virtio_dev_merge_rx(dev, queue_id, pkts, count);
else
return virtio_dev_rx(dev, queue_id, pkts, count);
}
Both of these methods will transmit packets into the instance and will notify the instance via an interrupt (write) if necessary:
dpdk-stable-16.11.4/lib/librte_vhost/virtio_net.c
/**
* This function adds buffers to the virtio devices RX virtqueue. Buffers can
* be received from the physical port or from another virtio device. A packet
* count is returned to indicate the number of packets that are succesfully
* added to the RX queue. This function works when the mbuf is scattered, but
* it doesn't support the mergeable feature.
*/
static inline uint32_t __attribute__((always_inline))
virtio_dev_rx(struct virtio_net *dev, uint16_t queue_id,
struct rte_mbuf **pkts, uint32_t count)
(...)
/* Kick the guest if necessary. */
if (!(vq->avail->flags & VRING_AVAIL_F_NO_INTERRUPT)
&& (vq->callfd >= 0))
eventfd_write(vq->callfd, (eventfd_t)1);
return count;
(...)
static inline uint32_t __attribute__((always_inline))
virtio_dev_merge_rx(struct virtio_net *dev, uint16_t queue_id,
struct rte_mbuf **pkts, uint32_t count)
(...)
From the above method virtio_dev_rx
:
319 avail_idx = *((volatile uint16_t *)&vq->avail->idx);
320 start_idx = vq->last_used_idx;
321 free_entries = avail_idx - start_idx;
322 count = RTE_MIN(count, free_entries);
323 count = RTE_MIN(count, (uint32_t)MAX_PKT_BURST);
324 if (count == 0)
325 return 0;
The number of packets to be transmitted is hence set to MIN( [number of free entries] , [MAX_PKT_BURST] ,[ number of packets to be transmitted passed from calling method] )
.
Another unlikely case will set the count to a lower value and will break the for loop:
344 for (i = 0; i < count; i++) {
345 uint16_t desc_idx = desc_indexes[i];
346 int err;
347
348 if (vq->desc[desc_idx].flags & VRING_DESC_F_INDIRECT) {
349 descs = (struct vring_desc *)(uintptr_t)gpa_to_vva(dev,
350 vq->desc[desc_idx].addr);
351 if (unlikely(!descs)) {
352 count = i;
353 break;
354 }
In the end, the used index is increased by count:
378 *(volatile uint16_t *)&vq->used->idx += count;
379 vq->last_used_idx += count;
380 vhost_log_used_vring(dev, vq,
381 offsetof(struct vring_used, idx),
382 sizeof(vq->used->idx));
Data is copied into the instance's memory via:
48 #include "vhost.h"
(...)
363 err = copy_mbuf_to_desc(dev, descs, pkts[i], desc_idx, sz);
This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.
Comments