On Thu, 28 Feb 2019 at 16:14, Robin Murphy <robin.mur...@arm.com> wrote: > > Hi Ard, > > On 28/02/2019 14:16, Ard Biesheuvel wrote: > > (+ Catalin) > > > > On Tue, 19 Feb 2019 at 16:08, Ilias Apalodimas > > <ilias.apalodi...@linaro.org> wrote: > >> > >> On Tue, Feb 19, 2019 at 12:08:42AM +0100, Ard Biesheuvel wrote: > >>> It turns out that the IP checksumming code is still exercised often, > >>> even though one might expect that modern NICs with checksum offload > >>> have no use for it. However, as Lingyan points out, there are > >>> combinations of features where the network stack may still fall back > >>> to software checksumming, and so it makes sense to provide an > >>> optimized implementation in software as well. > >>> > >>> So provide an implementation of do_csum() in scalar assembler, which, > >>> unlike C, gives direct access to the carry flag, making the code run > >>> substantially faster. The routine uses overlapping 64 byte loads for > >>> all input size > 64 bytes, in order to reduce the number of branches > >>> and improve performance on cores with deep pipelines. > >>> > >>> On Cortex-A57, this implementation is on par with Lingyan's NEON > >>> implementation, and roughly 7x as fast as the generic C code. > >>> > >>> Cc: "huanglingyan (A)" <huanglingy...@huawei.com> > >>> Signed-off-by: Ard Biesheuvel <ard.biesheu...@linaro.org> > > ... > >> > >> Acked-by: Ilias Apalodimas <ilias.apalodi...@linaro.org> > > > > Full patch here > > > > https://lore.kernel.org/linux-arm-kernel/20190218230842.11448-1-ard.biesheu...@linaro.org/ > > > > This was a follow-up to some discussions about Lingyan's NEON code, > > CC'ed to netdev@ so people could chime in as to whether we need > > accelerated checksumming code in the first place.
Thanks for taking a look. > FWIW ever since we did ip_fast_csum() I've been meaning to see how well > I can do with a similar tweaked C implementation for this (mostly for > fun). Since I've recently dug out my RK3328 box for other reasons I'll > give this a test - that's a weedy little quad-A53 whose GbE hardware > checksumming is slightly busted and has to be turned off, so the > do_csum() overhead under heavy network load is comparatively massive. > (plus it's non-EFI so I should be able to try big-endian easily too) > Yes please. I've been meaning to run this on A72 myself, but ever since my MacchiatoBin self-combusted, I've been relying on AWS for this, which is a bit finicky. As for the C implementation, not having access to the carry flag is pretty limiting, so I wonder how you intend to get around that. > The asm looks pretty reasonable to me - instinct says there's *possibly* > some value for out-of-order cores in doing the 8-way accumulations in a > more pairwise fashion, but I guess either way the carry flag dependency > is going to dominate, so it may well be moot. Yes. In fact, I was surprised the speedup is as dramatic as it is despite of this, but I guess they optimize for this rather well at the uarch level. > What may be more > worthwhile is taking the effort to align the source pointer, at least > for larger inputs, so as to be kinder to little cores - according to its > optimisation guide, A55 is fairly sensitive to unaligned loads, so I'd > assume that's true of its older/smaller friends too. I'll see what I can > measure in practice - until proven otherwise I'd have no great objection > to merging this patch as-is if the need is real. Improvements can always > come later :) > Good point re alignment, I didn't consider that at all tbh. I'll let the maintainers decide whether/when to merge this. I don't feel strongly either way.