On Mon, 2016-11-21 at 17:03 +0100, Jesper Dangaard Brouer wrote: > On Thu, 17 Nov 2016 10:51:23 -0800 > Eric Dumazet <eric.duma...@gmail.com> wrote: > > > On Thu, 2016-11-17 at 19:30 +0100, Jesper Dangaard Brouer wrote: > > > > > The point is I can see a socket Send-Q forming, thus we do know the > > > application have something to send. Thus, and possibility for > > > non-opportunistic bulking. Allowing/implementing bulk enqueue from > > > socket layer into qdisc layer, should be fairly simple (and rest of > > > xmit_more is already in place). > > > > > > As I said, you are fooled by TX completions. > > Obviously TX completions play a role yes, and I bet I can adjust the > TX completion to cause xmit_more to happen, at the expense of > introducing added latency. > > The point is the "bloated" spinlock in __dev_queue_xmit is still caused > by the MMIO tailptr/doorbell. The added cost occurs when enqueueing > packets, and result in the inability to get enough packets into the > qdisc for xmit_more going (on my system). I argue that a bulk enqueue > API would allow us to get past the hurtle of transitioning into > xmit_more mode more easily. >
This is very nice, but we already have bulk enqueue, it is called xmit_more. Kernel does not know your application is sending a packet after the one you send. xmit_more is not often used applications/stacks send many small packets. qdisc is empty (one enqueued packet is immediately dequeued so skb->xmit_more is 0), and even bypassed (TCQ_F_CAN_BYPASS) Not sure it this has been tried before, but the doorbell avoidance could be done by the driver itself, because it knows a TX completion will come shortly (well... if softirqs are not delayed too much !) Doorbell would be forced only if : ( "skb->xmit_more is not set" AND "TX engine is not 'started yet'" ) OR ( too many [1] packets were put in TX ring buffer, no point deferring more) Start the pump, but once it is started, let the doorbells being done by TX completion. ndo_start_xmit and TX completion handler would have to maintain a shared state describing if packets were ready but doorbell deferred. Note that TX completion means "if at least one packet was drained", otherwise busy polling, constantly calling napi->poll() would force a doorbell too soon for devices sharing a NAPI for both RX and TX. But then, maybe busy poll would like to force a doorbell... I could try these ideas on mlx4 shortly. [1] limit could be derived from active "ethtool -c" params, eg tx-frames