On Fri, Nov 06, 2020 at 05:39:38PM +0100, Ard Biesheuvel wrote:
> Based on lessons learnt from optimizing the 32-bit version of this driver,
> we can simplify the arm64 version considerably, by reordering the final
> two stores when the last block is not a multiple of 64 bytes. This removes
> the need to use permutation instructions to calculate the elements that are
> clobbered by the final overlapping store, given that the store of the
> penultimate block now follows it, and that one carries the correct values
> for those elements already.
> 
> While at it, simplify the overlapping loads as well, by calculating the
> address of the final overlapping load upfront, and switching to this
> address for every load that would otherwise extend past the end of the
> source buffer.
> 
> There is no impact on performance, but the resulting code is substantially
> smaller and easier to follow.
> 
> Cc: Eric Biggers <ebigg...@google.com>
> Cc: "Jason A . Donenfeld" <ja...@zx2c4.com>
> Signed-off-by: Ard Biesheuvel <a...@kernel.org>
> ---
>  arch/arm64/crypto/chacha-neon-core.S | 193 +++++++-------------
>  1 file changed, 69 insertions(+), 124 deletions(-)

Patch applied.  Thanks.
-- 
Email: Herbert Xu <herb...@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

Reply via email to