On Thu, 2016-07-07 at 21:16 -0700, Alexei Starovoitov wrote: > I've tried this style of prefetching in the past for normal stack > and it didn't help at all.
This is very nice, but my experience showed opposite numbers. So I guess you did not choose the proper prefetch strategy. prefetching in mlx4 gave me good results, once I made sure our compiler was not moving the actual prefetch operations on x86_64 (ie forcing use of asm volatile as in x86_32 instead of the builtin prefetch). You might check if your compiler does the proper thing because this really hurt me in the past. In my case, I was using 40Gbit NIC, and prefetching 128 bytes instead of 64 bytes allowed to remove one stall in GRO engine when using TCP with TS (total header size : 66 bytes), or tunnels. The problem with prefetch is that it works well assuming a given rate (in pps), and given cpus, as prefetch behavior is varying among flavors. Brenden chose to prefetch N+3, based on some experiments, on some hardware, prefetch N+3 can actually slow down if you receive a moderate load, which is the case 99% of the time in typical workloads on modern servers with multi queue NIC. This is why it was hard to upstream such changes, because they focus on max throughput instead of low latencies.