Re: [PATCH] arm64: do_csum: implement accelerated scalar version

Robin Murphy Thu, 28 Feb 2019 07:15:06 -0800

Hi Ard,

On 28/02/2019 14:16, Ard Biesheuvel wrote:

(+ Catalin)


On Tue, 19 Feb 2019 at 16:08, Ilias Apalodimas
<[email protected]> wrote:


On Tue, Feb 19, 2019 at 12:08:42AM +0100, Ard Biesheuvel wrote:

It turns out that the IP checksumming code is still exercised often,
even though one might expect that modern NICs with checksum offload
have no use for it. However, as Lingyan points out, there are
combinations of features where the network stack may still fall back
to software checksumming, and so it makes sense to provide an
optimized implementation in software as well.

So provide an implementation of do_csum() in scalar assembler, which,
unlike C, gives direct access to the carry flag, making the code run
substantially faster. The routine uses overlapping 64 byte loads for
all input size > 64 bytes, in order to reduce the number of branches
and improve performance on cores with deep pipelines.

On Cortex-A57, this implementation is on par with Lingyan's NEON
implementation, and roughly 7x as fast as the generic C code.

Cc: "huanglingyan (A)" <[email protected]>
Signed-off-by: Ard Biesheuvel <[email protected]>

...


Acked-by: Ilias Apalodimas <[email protected]>


Full patch here

https://lore.kernel.org/linux-arm-kernel/[email protected]/

This was a follow-up to some discussions about Lingyan's NEON code,
CC'ed to netdev@ so people could chime in as to whether we need
accelerated checksumming code in the first place.

FWIW ever since we did ip_fast_csum() I've been meaning to see how wellI can do with a similar tweaked C implementation for this (mostly forfun). Since I've recently dug out my RK3328 box for other reasons I'llgive this a test - that's a weedy little quad-A53 whose GbE hardwarechecksumming is slightly busted and has to be turned off, so thedo_csum() overhead under heavy network load is comparatively massive.(plus it's non-EFI so I should be able to try big-endian easily too)

The asm looks pretty reasonable to me - instinct says there's *possibly*some value for out-of-order cores in doing the 8-way accumulations in amore pairwise fashion, but I guess either way the carry flag dependencyis going to dominate, so it may well be moot. What may be moreworthwhile is taking the effort to align the source pointer, at leastfor larger inputs, so as to be kinder to little cores - according to itsoptimisation guide, A55 is fairly sensitive to unaligned loads, so I'dassume that's true of its older/smaller friends too. I'll see what I canmeasure in practice - until proven otherwise I'd have no great objectionto merging this patch as-is if the need is real. Improvements can alwayscome later :)


Robin.

Re: [PATCH] arm64: do_csum: implement accelerated scalar version

Reply via email to