From: Jesper Dangaard Brouer
> Sent: 17 November 2016 14:58
> On Thu, 17 Nov 2016 06:17:38 -0800
> Eric Dumazet <[email protected]> wrote:
>
> > On Thu, 2016-11-17 at 14:42 +0100, Jesper Dangaard Brouer wrote:
> >
> > > I can see that qdisc layer does not activate xmit_more in this case.
> > >
> >
> > Sure. Not enough pressure from the sender(s).
> >
> > The bottleneck is not the NIC or qdisc in your case, meaning that BQL
> > limit is kept at a small value.
> >
> > (BTW not all NIC have expensive doorbells)
>
> I believe this NIC mlx5 (50G edition) does.
>
> I'm seeing UDP TX of 1656017.55 pps, which is per packet:
> 2414 cycles(tsc) 603.86 ns
>
> Perf top shows (with my own udp_flood, that avoids __ip_select_ident):
>
> Samples: 56K of event 'cycles', Event count (approx.): 51613832267
> Overhead Command Shared Object Symbol
> + 8.92% udp_flood [kernel.vmlinux] [k] _raw_spin_lock
> - _raw_spin_lock
> + 90.78% __dev_queue_xmit
> + 7.83% dev_queue_xmit
> + 1.30% ___slab_alloc
> + 5.59% udp_flood [kernel.vmlinux] [k] skb_set_owner_w
> + 4.77% udp_flood [mlx5_core] [k] mlx5e_sq_xmit
> + 4.09% udp_flood [kernel.vmlinux] [k] fib_table_lookup
> + 4.00% swapper [mlx5_core] [k] mlx5e_poll_tx_cq
> + 3.11% udp_flood [kernel.vmlinux] [k]
> __ip_route_output_key_hash
> + 2.49% swapper [kernel.vmlinux] [k] __slab_free
>
> In this setup the spinlock in __dev_queue_xmit should be uncongested.
> An uncongested spin_lock+unlock cost 32 cycles(tsc) 8.198 ns on this system.
>
> But 8.92% of the time is spend on it, which corresponds to a cost of 215
> cycles (2414*0.0892). This cost is too high, thus something else is
> going on... I claim this mysterious extra cost is the tailptr/doorbell.
Try adding code to ring the doorbell twice.
If this doesn't slow things down then it isn't (likely to be) responsible
for the delay you are seeing.
David