Re: [PATCH] arm64: do_csum: implement accelerated scalar version

Ard Biesheuvel Thu, 28 Feb 2019 07:29:09 -0800

On Thu, 28 Feb 2019 at 16:14, Robin Murphy <[email protected]> wrote:
>
> Hi Ard,
>
> On 28/02/2019 14:16, Ard Biesheuvel wrote:
> > (+ Catalin)
> >
> > On Tue, 19 Feb 2019 at 16:08, Ilias Apalodimas
> > <[email protected]> wrote:
> >>
> >> On Tue, Feb 19, 2019 at 12:08:42AM +0100, Ard Biesheuvel wrote:
> >>> It turns out that the IP checksumming code is still exercised often,
> >>> even though one might expect that modern NICs with checksum offload
> >>> have no use for it. However, as Lingyan points out, there are
> >>> combinations of features where the network stack may still fall back
> >>> to software checksumming, and so it makes sense to provide an
> >>> optimized implementation in software as well.
> >>>
> >>> So provide an implementation of do_csum() in scalar assembler, which,
> >>> unlike C, gives direct access to the carry flag, making the code run
> >>> substantially faster. The routine uses overlapping 64 byte loads for
> >>> all input size > 64 bytes, in order to reduce the number of branches
> >>> and improve performance on cores with deep pipelines.
> >>>
> >>> On Cortex-A57, this implementation is on par with Lingyan's NEON
> >>> implementation, and roughly 7x as fast as the generic C code.
> >>>
> >>> Cc: "huanglingyan (A)" <[email protected]>
> >>> Signed-off-by: Ard Biesheuvel <[email protected]>
> > ...
> >>
> >> Acked-by: Ilias Apalodimas <[email protected]>
> >
> > Full patch here
> >
> > https://lore.kernel.org/linux-arm-kernel/[email protected]/
> >
> > This was a follow-up to some discussions about Lingyan's NEON code,
> > CC'ed to netdev@ so people could chime in as to whether we need
> > accelerated checksumming code in the first place.


Thanks for taking a look.

> FWIW ever since we did ip_fast_csum() I've been meaning to see how well
> I can do with a similar tweaked C implementation for this (mostly for
> fun). Since I've recently dug out my RK3328 box for other reasons I'll
> give this a test - that's a weedy little quad-A53 whose GbE hardware
> checksumming is slightly busted and has to be turned off, so the
> do_csum() overhead under heavy network load is comparatively massive.
> (plus it's non-EFI so I should be able to try big-endian easily too)
>

Yes please. I've been meaning to run this on A72 myself, but ever
since my MacchiatoBin self-combusted, I've been relying on AWS for
this, which is a bit finicky.

As for the C implementation, not having access to the carry flag is
pretty limiting, so I wonder how you intend to get around that.

> The asm looks pretty reasonable to me - instinct says there's *possibly*
> some value for out-of-order cores in doing the 8-way accumulations in a
> more pairwise fashion, but I guess either way the carry flag dependency
> is going to dominate, so it may well be moot.

Yes. In fact, I was surprised the speedup is as dramatic as it is
despite of this, but I guess they optimize for this rather well at the
uarch level.

> What may be more
> worthwhile is taking the effort to align the source pointer, at least
> for larger inputs, so as to be kinder to little cores - according to its
> optimisation guide, A55 is fairly sensitive to unaligned loads, so I'd
> assume that's true of its older/smaller friends too. I'll see what I can
> measure in practice - until proven otherwise I'd have no great objection
> to merging this patch as-is if the need is real. Improvements can always
> come later :)
>

Good point re alignment, I didn't consider that at all tbh.

I'll let the maintainers decide whether/when to merge this. I don't
feel strongly either way.

Re: [PATCH] arm64: do_csum: implement accelerated scalar version

Reply via email to