Re: [PATCH] arm64: do_csum: implement accelerated scalar version

Robin Murphy Wed, 15 May 2019 03:58:19 -0700

On 15/05/2019 11:15, David Laight wrote:

...

        ptr = (u64 *)(buff - offset);
        shift = offset * 8;


        /*
         * Head: zero out any excess leading bytes. Shifting back by the same
         * amount should be at least as fast as any other way of handling the
         * odd/even alignment, and means we can ignore it until the very end.
         */
        data = *ptr++;
#ifdef __LITTLE_ENDIAN
        data = (data >> shift) << shift;
#else
        data = (data << shift) >> shift;
#endif


I suspect that
#ifdef __LITTLE_ENDIAN
        data &= ~0ull << shift;
#else
        data &= ~0ull >> shift;
#endif
is likely to be better.

Out of interest, better in which respects? For the A64 ISA at least,that would take 3 instructions plus an additional scratch register, e.g.:


        MOV     x2, #~0
        LSL     x2, x2, x1
        AND     x0, x0, x1

(alternatively "AND x0, x0, x1 LSL x2" to save 4 bytes of code, but thatwill typically take as many cycles if not more than just pipelining thetwo 'simple' ALU instructions)


Whereas the original is just two shift instruction in-place.

        LSR     x0, x0, x1
        LSL     x0, x0, x1

If the operation were repeated, the constant generation could certainlybe amortised over multiple subsequent ANDs for a net win, but that isn'tthe case here.


Robin.

Re: [PATCH] arm64: do_csum: implement accelerated scalar version

Reply via email to