On Tue, 14 Jul 2020 11:21:45 +0000 Claudiu Manoil wrote:
> >Does it really make sense to implement DIM for TX?
> >
> >For TX the only thing we care about is that no queue in the system
> >underflows. So the calculation is simply timeout = queue len / speed.
> >The only problem is which queue in the system is the smallest (TX
> >ring, TSQ etc.) but IMHO there's little point in the extra work to
> >calculate the thresholds dynamically. On real life workloads the
> >scheduler overhead the async work structs introduce cause measurable
> >regressions.
> >
> >That's just to share my experience, up to you to decide if you want
> >to keep the TX-side DIM or not :)  
> 
> Yeah, I'm not happy either with Tx DIM, it seems too much for this device,
> too much overhead.
> But it seemed there's no other option left, because leaving coalescing as
> disabled for Tx is not an option as there are too many Tx interrupts, but
> on the other hand coming up with a single Tx coalescing time threshold to
> cover all the possible cases is not feasible either.  However your suggestion
> to compute the Tx coalescing values based on link speed, at least that's how
> I read it, is worth investigating.  This device is supposed to handle link 
> speeds
> ranging from 10Mbit to 2.5G, so it would be great if TX DIM could be replaced
> replaced in this case by a set of precomputed values based on link speed.
> I'm going to look into this.  If you have any other suggestion on this pls 
> let me know.

If you were happy with TX DIM - my guess would be that even if you
leave the TX coalescing with the value optimal for 2.5G - it will be
perfectly fine for other speeds, too. TX DIM is quite aggressive, if
I'm reading the code correctly it maxes out at 64us - which is a low
value for TX.

In my experiments with 25G NICs and TCP workloads (and some synthetic
netperf TCP_RR) the optimal value seems to be TSQ / link speed (- some
safety margin). Which is ~360us for 25G, since the TSQ value was bumped
to 1MB in recent kernels.

Obviously YMMV if the system is running a routing or raw socket app.
Then you presumably want to sustain max throughput on 2.5G with min
sized frames. And your rings by default hold 256 entries - that's still
~50us to complete a ring.

Reply via email to