On Tue, Dec 15, 2020 at 03:47:03PM -0800, Eric Biggers wrote:
> This patchset adds a NEON implementation of BLAKE2b for 32-bit ARM.
> Patches 1-4 prepare for it by making some updates to the generic
> implementation, while patch 5 adds the actual NEON implementation.
> 
> On Cortex-A7 (which these days is the most common ARM processor that
> doesn't have the ARMv8 Crypto Extensions), this is over twice as fast as
> SHA-256, and slightly faster than SHA-1.  It is also almost three times
> as fast as the generic implementation of BLAKE2b:
> 
>       Algorithm            Cycles per byte (on 4096-byte messages)
>       ===================  =======================================
>       blake2b-256-neon     14.1
>       sha1-neon            16.4
>       sha1-asm             20.8
>       blake2s-256-generic  26.1
>       sha256-neon          28.9
>       sha256-asm           32.1
>       blake2b-256-generic  39.9
> 
> This implementation isn't directly based on any other implementation,
> but it borrows some ideas from previous NEON code I've written as well
> as from chacha-neon-core.S.  At least on Cortex-A7, it is faster than
> the other NEON implementations of BLAKE2b I'm aware of (the
> implementation in the BLAKE2 official repository using intrinsics, and
> Andrew Moon's implementation which can be found in SUPERCOP).
> 
> NEON-optimized BLAKE2b is useful because there is interest in using
> BLAKE2b-256 for dm-verity on low-end Android devices (specifically,
> devices that lack the ARMv8 Crypto Extensions) to replace SHA-1.  On
> these devices, the performance cost of upgrading to SHA-256 may be
> unacceptable, whereas BLAKE2b-256 would actually improve performance.
> 
> Although BLAKE2b is intended for 64-bit platforms (unlike BLAKE2s which
> is intended for 32-bit platforms), on 32-bit ARM processors with NEON,
> BLAKE2b is actually faster than BLAKE2s.  This is because NEON supports
> 64-bit operations, and because BLAKE2s's block size is too small for
> NEON to be helpful for it.  The best I've been able to do with BLAKE2s
> on Cortex-A7 is 19.0 cpb with an optimized scalar implementation.

By the way, if people are interested in having my ARM scalar implementation of
BLAKE2s in the kernel too, I can send a patchset for that too.  It just ended up
being slower than BLAKE2b and SHA-1, so it wasn't as good for the use case
mentioned above.  If it were to be added as "blake2s-256-arm", we'd have:

        Algorithm            Cycles per byte (on 4096-byte messages)
        ===================  =======================================
        blake2b-256-neon     14.1
        sha1-neon            16.4
        blake2s-256-arm      19.0
        sha1-asm             20.8
        blake2s-256-generic  26.1
        sha256-neon          28.9
        sha256-asm           32.1
        blake2b-256-generic  39.9

Reply via email to