Another issue I found during my tests last days, is a problem with BQL,
and more generally when driver stops/starts the queue.

When under stress and BQL stops the queue, driver TX completion does a
lot of work, and servicing CPU also takes over further qdisc_run().

The work-flow is :

1) collect up to 64 (or 256 packets for mlx4) packets from TX ring, and
unmap things, queue skbs for freeing.

2) Calls netdev_tx_completed_queue(ring->tx_queue, packets, bytes); 

if (test_and_clear_bit(__QUEUE_STATE_STACK_XOFF, &dev_queue->state))
     netif_schedule_queue(dev_queue);

This leaves a very tiny window where other cpus could grab __QDISC_STATE_SCHED
(They absolutely have no chance to grab it)

So we end up with one cpu doing the ndo_start_xmit() and TX completions,
and RX work.

This problem is magnified when XPS is used, if one mono-threaded application 
deals with
thousands of TCP sockets.

We should use an additional bit (__QDISC_STATE_PLEASE_GRAB_ME) or some way
to allow another cpu to service the qdisc and spare us.



Reply via email to