[PATCH v2 09/10] crypto: poly1305 - Add a two block SSE2 variant for x86_64

2015-07-16 Thread Martin Willi
Extends the x86_64 SSE2 Poly1305 authenticator by a function processing two consecutive Poly1305 blocks in parallel using a derived key r^2. Loop unrolling can be more effectively mapped to SSE instructions, further increasing throughput. For large messages, throughput increases by ~45-65% compare

[PATCH v2 10/10] crypto: poly1305 - Add a four block AVX2 variant for x86_64

2015-07-16 Thread Martin Willi
Extends the x86_64 Poly1305 authenticator by a function processing four consecutive Poly1305 blocks in parallel using AVX2 instructions. For large messages, throughput increases by ~15-45% compared to two block SSE2: testing speed of poly1305 (poly1305-simd) test 0 ( 96 byte blocks, 16 bytes

[PATCH v2 05/10] crypto: chacha20 - Add an eight block AVX2 variant for x86_64

2015-07-16 Thread Martin Willi
Extends the x86_64 ChaCha20 implementation by a function processing eight ChaCha20 blocks in parallel using AVX2. For large messages, throughput increases by ~55-70% compared to four block SSSE3: testing speed of chacha20 (chacha20-simd) encryption test 0 (256 bit key, 16 byte blocks): 42249230 o

[PATCH v2 04/10] crypto: chacha20 - Add a four block SSSE3 variant for x86_64

2015-07-16 Thread Martin Willi
Extends the x86_64 SSSE3 ChaCha20 implementation by a function processing four ChaCha20 blocks in parallel. This avoids the word shuffling needed in the single block variant, further increasing throughput. For large messages, throughput increases by ~110% compared to single block SSSE3: testing s

[PATCH v2 08/10] crypto: poly1305 - Add a SSE2 SIMD variant for x86_64

2015-07-16 Thread Martin Willi
Implements an x86_64 assembler driver for the Poly1305 authenticator. This single block variant holds the 130-bit integer in 5 32-bit words, but uses SSE to do two multiplications/additions in parallel. When calling updates with small blocks, the overhead for kernel_fpu_begin/ kernel_fpu_end() neg

[PATCH v2 07/10] crypto: poly1305 - Export common Poly1305 helpers

2015-07-16 Thread Martin Willi
As architecture specific drivers need a software fallback, export Poly1305 init/update/final functions together with some helpers in a header file. Signed-off-by: Martin Willi --- crypto/chacha20poly1305.c | 4 +-- crypto/poly1305_generic.c | 73 +++

[PATCH v2 02/10] crypto: chacha20 - Export common ChaCha20 helpers

2015-07-16 Thread Martin Willi
As architecture specific drivers need a software fallback, export a ChaCha20 en-/decryption function together with some helpers in a header file. Signed-off-by: Martin Willi --- crypto/chacha20_generic.c | 28 crypto/chacha20poly1305.c | 3 +-- include/crypto/chacha

[PATCH v2 00/10] crypto: x86_64 - Add SSE/AVX2 ChaCha20/Poly1305 ciphers

2015-07-16 Thread Martin Willi
This patch series adds both ChaCha20 and Poly1305 specific ciphers for x86_64 using SSE2/SSSE3 and AVX2 instructions. The idea is to have a drop-in replacement for AESNI/CLMUL-accelerated AES-GCM providing at least somewhat comparable performance, refer to RFC7539 for details. It is based on crypto

[PATCH v2 06/10] crypto: testmgr - Add a longer ChaCha20 test vector

2015-07-16 Thread Martin Willi
The AVX2 variant of ChaCha20 is used only for messages with >= 512 bytes length. With the existing test vectors, the implementation could not be tested. Due that lack of such a long official test vector, this one is self-generated using chacha20-generic. Signed-off-by: Martin Willi --- crypto/te

[PATCH v2 01/10] crypto: tcrypt - Add ChaCha20/Poly1305 speed tests

2015-07-16 Thread Martin Willi
Adds individual ChaCha20 and Poly1305 and a combined rfc7539esp AEAD speed test using mode numbers 214, 321 and 213. For Poly1305 we add a specific speed template, as it expects the key prepended to the input data. Signed-off-by: Martin Willi --- crypto/tcrypt.c | 15 +++ crypto/tcry

[PATCH v2 03/10] crypto: chacha20 - Add a SSSE3 SIMD variant for x86_64

2015-07-16 Thread Martin Willi
Implements an x86_64 assembler driver for the ChaCha20 stream cipher. This single block variant works on a single state matrix using SSE instructions. It requires SSSE3 due the use of pshufb for efficient 8/16-bit rotate operations. For large messages, throughput increases by ~65% compared to chac

Re: crypto: chacha20poly1305 - Convert to new AEAD interface

2015-07-16 Thread Martin Willi
Herbert, > This patch converts rfc7539 and rfc7539esp to the new AEAD interface. > The test vectors for rfc7539esp have also been updated to include > the IV. Thanks for taking care of it, I haven't found the time yet to do it myself. I can confirm that it works fine under IPsec load, so you may