Why is my NAPI poll function causing high CPU usage?

Posted on

Issue
The NAPI poll function is causing high CPU usage on a single core (pegged at 100%). This leads me to believe I am not correctly implementing the NAPI poll function, or that there is a bug in the kernel version I am using. I am able to receive packets, however the moment NAPI is scheduled, we get high CPU usage. I have identified the CPU usage issue to be caused from the poll routine below.

Questions

  1. Is NAPI correctly implemented in the below example?
  2. How is NAPI scheduled to avoid high CPU usage for a single kernel thread? For example, if my device quickly consumes its budget eventually causing NAPI to quickly consume its system wide budget would NAPI starve others by continuing to schedule on that kernel thread or would it relinquish itself to allow others to process?

Information

  • Linux Distro - Red Hat Enterprise Linux 7.6 (Maipo)
  • Kernel - 3.10.0-957
  • We are using a custom piece of hardware (hence, why a custom driver is being written). It is a PCIe Gen 2, width x4 with a link speed of 5 GT/s.
  • Processor - Intel Core i7-5700EQ @ 2.60 GHz

Further Details
I have been using the top command to monitor CPU usage. When NAPI begins to run a single ksoftirqd/ will cause 100% usage on one of the cores. In efforts to debug I have commented out nic_clean_tx() and nic_clean_rx() (see NAPI Poll Routine below) so the poll routine does NOT process any packets, but allows NAPI to keep polling. I would have expected CPU usage to go down since this device quickly cleared its budget, but the CPU usage remained high. Note: our net.core.netdev_budget is 300, and our device is hardcoded to have a budget of 64.

Code
The code given was originally referenced against the Intel IGB Driver to ensure correct usage of NAPI. In addition (although outdate) I used NAPI-HOWTO for reference. I have added comments for further clarity into what each function is doing to avoid posting more code than necessary.

Initialize NAPI Context

netif_napi_add(adapter->netdev, &adapter->napi, nic_napi_poll, 64);

MSI Interrupt Handler

static irqreturn_t nic_intr_msi(int irq, void *data)
{
    struct Adapter *adapter = data;

    if(napi_schedule_prep(&adapter->napi))
    {
        disable_irq_nosync(&adapter->irqs[TX_IRQ]);
        disable_irq_nosync(&adapter->irqs[RX_IRQ]);

        disable_adapter_msi(adapter);

        __napi_schedule(&adapter->napi);

        enable_irq(&adapter->irqs[TX_IRQ]);
        enable_irq(&adapter->irqs[RX_IRQ]);
    }
}

NAPI Poll Routine

static int32_t nic_napi_poll(struct napi_struct* napi, int32_t weight)
{
    struct Adapter *adapter = NULL;
    bool clean_complete = false;
    unsigned long irq_flags = 0;
    int32_t budget = weight;

    // Get the private data for the adapter
    adapter = container_of(napi, struct Adapter, napi);

    // Update buffer descriptors for any packets sent.
    nic_clean_tx(adapter);

    // Receieve packets and update buffer descriptors for the given NAPI
    // Budget.
    clean_complete = nic_clean_rx(adapter->netdev, weight);

    if(!clean_complete)
    {
        // Not clean but, we have to give another adapter a chance to process
        goto not_clean;
    }

    // Not enough RX work has been done to justify polling any longer, go back to
    // Interrupts
    napi_complete(napi);
    budget = 0;
    adapter_enable_msi(adapter);

not_clean:

    return budget;
}

Cleaning RX (Receiving Packets and Updated Buffer Descriptors)

static bool nic_clean_rx(struct net_device* netdev, const int budget)
{
    uint64_t cleaned_count = 0;
    struct Adapter *adapter = netdev_priv(netdev);
    struct buffer_desc_ring *ring = NULL;
    struct buffer_desc *desc = NULL;
    struct sk_buff *skb = NULL;
    uint16_t len = 0;
    int net_rx_status = NET_RX_SUCCESS;

    do {
        // Our device has multiple buffer rings, similar to multi-queue but
        // hardware algorithms are used to place packets in the appropriate queue.
        // This function will find me the correct ring, with packets ready to be
        // received.
        ring = get_high_rx_q(adapter, &desc, MLS_NIC_DIFF_SERV_EF_QUEUE);
        if(!ring)
        {
            break;
        }

        // This will get the buffer descriptor in the associated ring
        // that is currently to be processed. An index called "clean_index" 
        // is used to identify this hence "get_clean_desc()".
        desc = get_clean_desc(ring);

        /* Allocate an SKB large enough for the packet we received. */
        len = get_desc_len(desc);
        skb = netdev_alloc_skb(ring->netdev, len);
        if(unlikely(!skb))
        {
            atomic64_inc(&(ring->stats.alloc_failed));
            break;
        }

        // get_clean_buffer is similar to get_clean_desc, but instead of
        // getting a buffer descriptor it will get the buffer that descriptor
        // is referencing.
        memcpy(skb_put(skb, len), get_clean_buffer(ring), len);
        skb->protocol = eth_type_trans(skb, ring->netdev);
        skb->dev = ring->netdev;
        skb->ip_summed = CHECKSUM_NONE;

        // Post SKB to the stack.
        net_rx_status = netif_receive_skb(skb);
        if(net_rx_status != NET_RX_SUCCESS)
        {
            atomic64_inc(&(ring->stats.drops));
        }

        /* This should never happen, as the driver is the only one setting
         * the descriptor back to empty. Check anyway.
         */
        if(unlikely(!clean_rx_ring_desc(ring, desc)))
        {
            // TODO: figure out what to do in this error condition.
            printk(KERN_ERR "%s: Unable to clean RX descriptor\n", DRIVER_NAME);
            break;
        }

        inc_clean_desc(ring);
        cleaned_count++;

    } while(likely(cleaned_count < budget));

    return (cleaned_count < budget);
}

Responses