On Tue, 14 Jul 2020 11:21:45 +0000 Claudiu Manoil wrote: > >Does it really make sense to implement DIM for TX? > > > >For TX the only thing we care about is that no queue in the system > >underflows. So the calculation is simply timeout = queue len / speed. > >The only problem is which queue in the system is the smallest (TX > >ring, TSQ etc.) but IMHO there's little point in the extra work to > >calculate the thresholds dynamically. On real life workloads the > >scheduler overhead the async work structs introduce cause measurable > >regressions. > > > >That's just to share my experience, up to you to decide if you want > >to keep the TX-side DIM or not :) > > Yeah, I'm not happy either with Tx DIM, it seems too much for this device, > too much overhead. > But it seemed there's no other option left, because leaving coalescing as > disabled for Tx is not an option as there are too many Tx interrupts, but > on the other hand coming up with a single Tx coalescing time threshold to > cover all the possible cases is not feasible either. However your suggestion > to compute the Tx coalescing values based on link speed, at least that's how > I read it, is worth investigating. This device is supposed to handle link > speeds > ranging from 10Mbit to 2.5G, so it would be great if TX DIM could be replaced > replaced in this case by a set of precomputed values based on link speed. > I'm going to look into this. If you have any other suggestion on this pls > let me know.
If you were happy with TX DIM - my guess would be that even if you leave the TX coalescing with the value optimal for 2.5G - it will be perfectly fine for other speeds, too. TX DIM is quite aggressive, if I'm reading the code correctly it maxes out at 64us - which is a low value for TX. In my experiments with 25G NICs and TCP workloads (and some synthetic netperf TCP_RR) the optimal value seems to be TSQ / link speed (- some safety margin). Which is ~360us for 25G, since the TSQ value was bumped to 1MB in recent kernels. Obviously YMMV if the system is running a routing or raw socket app. Then you presumably want to sustain max throughput on 2.5G with min sized frames. And your rings by default hold 256 entries - that's still ~50us to complete a ring.