On Thu, 12 Jul 2018 23:10:28 +0300 Or Gerlitz <gerlitz...@gmail.com> wrote:
> On Wed, Jul 11, 2018 at 11:06 PM, Jesper Dangaard Brouer > <bro...@redhat.com> wrote: > > > Well, I would prefer you to implement those. I just did a quick > > implementation (its trivially easy) so I have something to benchmark > > with. The performance boost is quite impressive! > > sounds good, but wait > > > > One reason I didn't "just" send a patch, is that Edward so-fare only > > implemented netif_receive_skb_list() and not napi_gro_receive_list(). > > sfc does't support gro?! doesn't make sense.. Edward? > > > And your driver uses napi_gro_receive(). This sort-of disables GRO for > > your driver, which is not a choice I can make. Interestingly I get > > around the same netperf TCP_STREAM performance. > > Same TCP performance I said around the same... I'll redo the benchmarks and verify... (did it.. see later). > with GRO and no rx-batching > > or > > without GRO and yes rx-batching Yes, obviously without GRO and yes rx-batching. > is by far not intuitive result to me unless both these techniques > mostly serve to eliminate lots of instruction cache misses and the > TCP stack is so much optimized that if the code is in the cache, > going through it once with 64K byte GRO-ed packet is like going > through it ~40 (64K/1500) times with non GRO-ed packets. Actually the GRO code path is actually rather expensive, and uses a lot of indirect-calls. If you have an UDP workload, then disable-GRO will give you a 10-15% performance boost. Edward's changes are basically a generalized version of GRO, up-to the IP layer (ip_rcv). So, for me it makes perfect sense. > What's the baseline (with GRO and no rx-batching) number on your setup? Okay, redoing the benchmarks... Implemented a code hack so I runtime can control if mlx5 driver uses napi_gro_receive() or netif_receive_skb_list() (abusing a netdev ethtool controlled feature flag no-in-use). To get a quick test going with feedback every 3 sec I use: $ netperf -t TCP_STREAM -H 198.18.1.1 -D3 -l 60000 -T 4,4 Default: using napi_gro_receive() with GRO enabled: Interim result: 25995.28 10^6bits/s over 3.000 seconds Disable GRO but still use napi_gro_receive(): Interim result: 21980.45 10^6bits/s over 3.001 seconds Make driver use netif_receive_skb_list(): Interim result: 25490.67 10^6bits/s over 3.002 seconds As you can see, using netif_receive_skb_list() have a huge performance boost over disabled-GRO. And it comes very close to the performance of enabled-GRO. Which is rather impressive! :-) Notice, even more impressively; these tests are without CONFIG_RETPOLINE. We primarily merged netif_receive_skb_list() due to the overhead of RETPOLINEs, but we even see a benefit when not using RETPOLINEs. > > I assume we can get even better perf if we "listify" napi_gro_receive. > > yeah, that would be very interesting to get there -- Best regards, Jesper Dangaard Brouer MSc.CS, Principal Kernel Engineer at Red Hat LinkedIn: http://www.linkedin.com/in/brouer