ixgbe vs ixgbevf Performance Differences

Latest response

When deploying into AWS and wishing to make use of optimized networking (10Gbps), the AWS Documentation states that the ixgbevf drivers must be used. However, the drivers bundled with the EL6 and EL7 kernels is not compatible with the AWS SR-IOV implementation. One can use third-party ixgbevf drivers, but it's kind of painful. The other option is to ignore the AWS guidance and use the ixgbe drivers, instead.

I'm just wondering if anyone's benchmarked the performance - and associated instance-overhead - of using the Red Hat bundled ixgbe drivers or using the AWS-recommended ixgbevf drivers.

AWS also offers using Elastic Network Adapters, but that depends on running a 3.2+ kernel. That's fine for RHEL 7, but pretty much all of my customers are on or deploying on RHEL 6.

Responses

That's rather unexpected, Intel PF and VF use different PCI device IDs. This from EL7:

drivers/net/ethernet/intel/ixgbe/ixgbe_type.h

#define IXGBE_DEV_ID_X550T              0x1563
#define IXGBE_DEV_ID_X550T1             0x15D1
...
/* VF Device IDs */
#define IXGBE_DEV_ID_82599_VF           0x10ED
#define IXGBE_DEV_ID_X540_VF            0x1515
#define IXGBE_DEV_ID_X550_VF            0x1565

I didn't think it would be possible to use the PF driver (ixgbe) on a VF (ixgbevf) or vice versa.

What IDs does lspci -nn | egrep Ether show?

If you can load the PF driver to drive the VF, that doesn't sound like a particularly good idea. There are many code paths in the physical driver which would probably fail on a virtual function.

Why don't the EL6 drivers work? The 82599 is an early 10GbE chipset and the device ID for the VF is in the EL6 driver:

drivers/net/ixgbe/ixgbe_type.h

/* VF Device IDs */
#define IXGBE_DEV_ID_82599_VF           0x10ED

This device ID and the function ixgbe_enable_sriov() have been present since RHEL 6.2.

Can't really answer your "why does AWS want 2.14.2+" - just that they very specifically call that version out in their documentation (obliquely, in the Ubuntu section - in the "other Linuxes section, the only RHEL they even speak to is RHEL 7). Having dealt with other AWS elements that call out specific versions (e.g., CodeCommit's Git and libcurl requirements), attempts to use less than the version specified results in either flakey or completely broken functionality.

With respect to ixgbe vs ixgbevf - the primary difference between EL6 AMIs that didn't work with M4-generation instance-types and EL6 AMIs that did work with M4-generation instance types was whether the ixgbe was explicitly enabled or not (the AWS MarketPlace AMIs had dracut activating ixgbe and ixgbevf; my forks of those AMIs only had ixgbevf activated - until I added ixgbe, my forks were not compatible with M4 generation instance-types).

That said, some of the outputs that are returned by subsystem queries work when newer-generation ixgbevf is the loaded driver but return incomplete when not using the newer-generation ixgbevf driver. It makes hard-digging into things a bit more indirect. Overall, working with AWS-hosted RHEL is can be very frustrating from a diagnostic standpoint. Some tools (ethtool for one) don't return (all) the data you might normally expect. Worse, if things do get jacked up, you don't have a recovery-console to work with (best you can do is take the disk from a broken instance, attach it to a diagnostic instance and hope the clue-nuggets were logged to that broken instance's disk).

Ok. Couldn't let it be. While the dracut options were the only differences between getting a 10Gbps instance that worked and one that never got networking (with eth: error fetching interface information: Device not found" in its boot logs), it looks like the final run-state is using ixgbevf:

# ethtool eth0
Settings for eth0:
        Supported ports: [ ]
        Supported link modes:   10000baseT/Full
        Supported pause frame use: No
        Supports auto-negotiation: No
        Advertised link modes:  Not reported
        Advertised pause frame use: No
        Advertised auto-negotiation: No
        Speed: 10000Mb/s
        Duplex: Full
        Port: Other
        PHYAD: 0
        Transceiver: Unknown!
        Auto-negotiation: off
        Current message level: 0x00000007 (7)
                               drv probe link
        Link detected: yes
# ethtool -i eth0
driver: ixgbevf
version: 2.12.1-k
firmware-version:
bus-info: 0000:00:03.0
supports-statistics: yes
supports-test: yes
supports-eeprom-access: no
supports-register-dump: yes
supports-priv-flags: no
# modinfo ixgbevf
filename:       /lib/modules/2.6.32-642.4.2.el6.x86_64/kernel/drivers/net/ixgbevf/ixgbevf.ko
version:        2.12.1-k
license:        GPL
description:    Intel(R) 82599 Virtual Function Driver
author:         Intel Corporation, <linux.nics@intel.com>
srcversion:     DE62374598B0A0CCD2C4C86
alias:          pci:v00008086d000015A8sv*sd*bc*sc*i*
alias:          pci:v00008086d00001565sv*sd*bc*sc*i*
alias:          pci:v00008086d00001515sv*sd*bc*sc*i*
alias:          pci:v00008086d000010EDsv*sd*bc*sc*i*
depends:
vermagic:       2.6.32-642.4.2.el6.x86_64 SMP mod_unload modversions
parm:           debug:Debug level (0=none,...,16=all) (int)
# lspci -nn | grep Ether
00:03.0 Ethernet controller [0200]: Intel Corporation 82599 Ethernet Controller Virtual Function [8086:10ed] (rev 01)

So now I've got even more of a freaking mystery (almost as much of a mystery as how, sometimes, when registering an image, launched instances' networking fails, but if re-register from the same template, launched instances' networking succeeds) . Greh.

I assume there's some earlier bug they consider the platform vulnerable to hitting. Here's the changelog on Linus' tree from 2.2.0 to 2.6.0 (there was never a 2.4.x here):

 $ git log --oneline c1a7e1e^..9cd9130 drivers/net/ethernet/intel/ixgbevf
9cd9130 ixgbevf: Update version string
795180d ixgbevf: Make sure jumbo frames are set correctly after PF reset
31a1b37 ixgbevf: Add support to recognize 100mb link speed
b3f4d59 intel: make wired ethernet driver message level consistent (rev2)
f794e7e ixgbevf: print MAC via printk format specifier
1a0d6ae rename dev_hw_addr_random and remove redundant second
dd48dc3 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
5c47a2b ixgbevf: Update copyright notices
3a2c403 ixgbevf: Fix mailbox interrupt ack bug
e404dec drivers/net: Remove unnecessary k.alloc/v.alloc OOM messages
3d8fe98 ixgbevf: make operations tables const
b5417bf ixgbevf: fix sparse warnings
b47aca1 ixgbevf: make ethtool ops and strings const
375b27c ixgbevf: Prevent possible race condition by checking for message
f131a6c ixgbevf: Fix register defines to correctly handle complex expressions
8e58613 net: make vlan ndo_vlan_rx_[add/kill]_vid return error value
1f2149c net: remove netdev_alloc_page and use __GFP_COLD
84b4050 Sweep away N/A fw_version dustbunnies from the .get_drvinfo routine of a number of drivers
f85fa27 Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/jkirsher/net-next
c8f44af net: introduce and use netdev_features_t for device features sets
ea99d83 intel: Convert <FOO>_LENGTH_OF_ADDRESS to ETH_ALEN
dbd9636 ixgbevf: Convert printks to pr_<level>
c1a7e1e ixgbevf: Update release version

At least the above result looks like it gives you a workaround.

Maybe Amazon could quantify why they recommend a later ixgbevf? If it's to work around a bug, chances are probably low at getting any patches in RHEL6 now, so the solution would be to run the Sourceforge driver or use a different interface type which has a good driver in the EL6 kernel.

I checked ELRepo but their kmod-ixgbevf is 2.16. They may accept a request on their mailing list to update it if you prefer their packaged module over Sourceforge.

Dunno, it's a weird situation. They offer two methods in their documentation: ENA and SRIOV. The former, they indicate wants a 3.2+ kernel. The latter, they indicate wants 2.14.2+ ixgbevf driver. ENA method obviously precludes use with RHEL 6. Latter initially appeared to mean that I was stuck with doing the sourceforge thing and leveraging DKMS (though, it looks like the dracut mods mean that the RHEL6 stock drivers will function).

At any rate, when I was previously digging through, it looked like Kernel.Org didn't incorporate the updated ixgbevf drivers until kernel 4.8. Amazon Linux is running that kernel (or, at least when I spun one up last month, that's where it was at). Their igxbgevf notes seem to indicate that they've only really tested their own Linux's drivers (presumably 2.14.2+), Ubuntu's (which at the time were 2.11) and very lightly with RHEL7 (since there's no mentions for RHEL6) - which seemed to have been lumped under the hand-waved "other Linuxes".

Like I indicated previously, some other AWS service integrations I've had to deal with, when AWS has called out minimum versions, I've encountered non-trivial (to fatal) problems when attempting to use the non-spec versions. So, I'd made the assumption that my previous AMIs' M4-generation incompatibilities were a manifestation of that. I'd done the third-party drivers/KMS work-around, but that wasn't likely to be a solution that made our accreditors happy. So, when I did the AMI config comparison between the RHUI-using MarketPlace Red Hat AMI and my forked AMI and found that difference, my assumption was that the upshot was using the ixgbe drivers vice the ixgbevf drivers

At this point, I can only suppose that something within the boot process's interactions with the AWS virtual hardware is perhaps using the ixgbe in the initramfs to shim its way to being able to use the stock ixgbevfs. It doesn't feel like that should be the case, but not being able to boot-trace in the AWS context leaves me with functional instances that I can't explain why they're working.

I freaking hate unsolved mysteries. :p

I can't offer much except some version information. Upstream has had ena since v4.8 though maybe the driver will run on 3.2 or later. Upstream has had ixgbevf since 2.6.34 though I didn't look into exactly which models were supported when.

I also dislike knowing how to get something working but not really knowing the underlying reason why it works.

[Content no longer relevant]

Tom, FYI I've started knowledgebase solution AWS Enhanced Networking requires ixgbevf driver version 2.14 about this, and I'm trying to find out exactly which upstream patches are required so we can determine if our modules actually can do this despite them being based on 2.12.

The stock drivers appear to work, just that if you come at things from the AWS documentation, you they indicate that you want the non-stock drivers. That said, in comparing a set of m4.10xlarge instances, there was no meaningful performance-difference between stock and the latest Intel drivers (hosted on sourceforge). Obviously, given the pricing on those instance types, I don't have any real-worl, long-running workloads to see if there's a stability difference (just iperf runs of various duration and parallelism).

One thing I'd recommend: if someone opts to use the SourceForge-hosted drivers, they'll want to conigure DKMS to make maintenance a bit less hateful (though, IA people will choke on the need to maintain a compiler).

I suspect that EL7 already includes the commit needed, maybe EL6 does as well, or maybe there's some corner-case bug resolved in 2.14 which most customers won't hit and they're just covering themselves against that.

I also just realised I missed a version above. ELRepo has 2.16 which is greater than 2.14 so that's a better option for those unable/unwilling to have a compiler installed.

I have seen customer environments which don't even have gcc/g++ in their Satellite repo to adhere with security requirements.

Sometimes I wish that Linux came with a kernel-compiler wholly separate from - and not suitable as - a generic compiler. But, that went out of style a long time ago (SGI jetisoned it with IRIX 6.x and the last OS I remember having it was HP/UX - possibly Tru64 - circa 2005).

Close

Welcome! Check out the Getting Started with Red Hat page for quick tours and guides for common tasks.