Another issue I found during my tests last days, is a problem with BQL, and more generally when driver stops/starts the queue.
When under stress and BQL stops the queue, driver TX completion does a lot of work, and servicing CPU also takes over further qdisc_run(). The work-flow is : 1) collect up to 64 (or 256 packets for mlx4) packets from TX ring, and unmap things, queue skbs for freeing. 2) Calls netdev_tx_completed_queue(ring->tx_queue, packets, bytes); if (test_and_clear_bit(__QUEUE_STATE_STACK_XOFF, &dev_queue->state)) netif_schedule_queue(dev_queue); This leaves a very tiny window where other cpus could grab __QDISC_STATE_SCHED (They absolutely have no chance to grab it) So we end up with one cpu doing the ndo_start_xmit() and TX completions, and RX work. This problem is magnified when XPS is used, if one mono-threaded application deals with thousands of TCP sockets. We should use an additional bit (__QDISC_STATE_PLEASE_GRAB_ME) or some way to allow another cpu to service the qdisc and spare us.