[PATCH v2 1/1] crypto: AES CTR x86_64 "by8" AVX optimization

chandramouli narayanan Tue, 03 Jun 2014 17:36:07 -0700

This patch introduces "by8" AES CTR mode AVX optimization inspired by
Intel Optimized IPSEC Cryptograhpic library. For additional information,
please see:
http://downloadcenter.intel.com/Detail_Desc.aspx?agr=Y&DwnldID=22972


The functions aes_ctr_enc_128_avx_by8(), aes_ctr_enc_192_avx_by8() and
aes_ctr_enc_256_avx_by8() are adapted from
Intel Optimized IPSEC Cryptographic library. When both AES and AVX features
are enabled in a platform, the glue code in AESNI module overrieds the
existing "by4" CTR mode en/decryption with the "by8"
AES CTR mode en/decryption.

On a Haswell desktop, with turbo disabled and all cpus running
at maximum frequency, the "by8" CTR mode optimization
shows better performance results across data & key sizes
as measured by tcrypt.

The average performance improvement of the "by8" version over the "by4"
version is as follows:

For 128 bit key and data sizes >= 256 bytes, there is a 10-16% improvement.
For 192 bit key and data sizes >= 256 bytes, there is a 20-22% improvement.
For 256 bit key and data sizes >= 256 bytes, there is a 20-25% improvement.

A typical run of tcrypt with AES CTR mode encryption of the "by4" and "by8"
optimization shows the following results:

tcrypt with "by4" AES CTR mode encryption optimization on a Haswell Desktop:
---------------------------------------------------------------------------

testing speed of __ctr-aes-aesni encryption
test 0 (128 bit key, 16 byte blocks): 1 operation in 343 cycles (16 bytes)
test 1 (128 bit key, 64 byte blocks): 1 operation in 336 cycles (64 bytes)
test 2 (128 bit key, 256 byte blocks): 1 operation in 491 cycles (256 bytes)
test 3 (128 bit key, 1024 byte blocks): 1 operation in 1130 cycles (1024 bytes)
test 4 (128 bit key, 8192 byte blocks): 1 operation in 7309 cycles (8192 bytes)
test 5 (192 bit key, 16 byte blocks): 1 operation in 346 cycles (16 bytes)
test 6 (192 bit key, 64 byte blocks): 1 operation in 361 cycles (64 bytes)
test 7 (192 bit key, 256 byte blocks): 1 operation in 543 cycles (256 bytes)
test 8 (192 bit key, 1024 byte blocks): 1 operation in 1321 cycles (1024 bytes)
test 9 (192 bit key, 8192 byte blocks): 1 operation in 9649 cycles (8192 bytes)
test 10 (256 bit key, 16 byte blocks): 1 operation in 369 cycles (16 bytes)
test 11 (256 bit key, 64 byte blocks): 1 operation in 366 cycles (64 bytes)
test 12 (256 bit key, 256 byte blocks): 1 operation in 595 cycles (256 bytes)
test 13 (256 bit key, 1024 byte blocks): 1 operation in 1531 cycles (1024 bytes)
test 14 (256 bit key, 8192 byte blocks): 1 operation in 10522 cycles (8192 
bytes)

testing speed of __ctr-aes-aesni decryption
test 0 (128 bit key, 16 byte blocks): 1 operation in 336 cycles (16 bytes)
test 1 (128 bit key, 64 byte blocks): 1 operation in 350 cycles (64 bytes)
test 2 (128 bit key, 256 byte blocks): 1 operation in 487 cycles (256 bytes)
test 3 (128 bit key, 1024 byte blocks): 1 operation in 1129 cycles (1024 bytes)
test 4 (128 bit key, 8192 byte blocks): 1 operation in 7287 cycles (8192 bytes)
test 5 (192 bit key, 16 byte blocks): 1 operation in 350 cycles (16 bytes)
test 6 (192 bit key, 64 byte blocks): 1 operation in 359 cycles (64 bytes)
test 7 (192 bit key, 256 byte blocks): 1 operation in 635 cycles (256 bytes)
test 8 (192 bit key, 1024 byte blocks): 1 operation in 1324 cycles (1024 bytes)
test 9 (192 bit key, 8192 byte blocks): 1 operation in 9595 cycles (8192 bytes)
test 10 (256 bit key, 16 byte blocks): 1 operation in 364 cycles (16 bytes)
test 11 (256 bit key, 64 byte blocks): 1 operation in 377 cycles (64 bytes)
test 12 (256 bit key, 256 byte blocks): 1 operation in 604 cycles (256 bytes)
test 13 (256 bit key, 1024 byte blocks): 1 operation in 1527 cycles (1024 bytes)
test 14 (256 bit key, 8192 byte blocks): 1 operation in 10549 cycles (8192 
bytes)

tcrypt with "by8" AES CTR mode encryption optimization on a Haswell Desktop:
---------------------------------------------------------------------------

testing speed of __ctr-aes-aesni encryption
test 0 (128 bit key, 16 byte blocks): 1 operation in 340 cycles (16 bytes)
test 1 (128 bit key, 64 byte blocks): 1 operation in 330 cycles (64 bytes)
test 2 (128 bit key, 256 byte blocks): 1 operation in 450 cycles (256 bytes)
test 3 (128 bit key, 1024 byte blocks): 1 operation in 1043 cycles (1024 bytes)
test 4 (128 bit key, 8192 byte blocks): 1 operation in 6597 cycles (8192 bytes)
test 5 (192 bit key, 16 byte blocks): 1 operation in 339 cycles (16 bytes)
test 6 (192 bit key, 64 byte blocks): 1 operation in 352 cycles (64 bytes)
test 7 (192 bit key, 256 byte blocks): 1 operation in 539 cycles (256 bytes)
test 8 (192 bit key, 1024 byte blocks): 1 operation in 1153 cycles (1024 bytes)
test 9 (192 bit key, 8192 byte blocks): 1 operation in 8458 cycles (8192 bytes)
test 10 (256 bit key, 16 byte blocks): 1 operation in 353 cycles (16 bytes)
test 11 (256 bit key, 64 byte blocks): 1 operation in 360 cycles (64 bytes)
test 12 (256 bit key, 256 byte blocks): 1 operation in 512 cycles (256 bytes)
test 13 (256 bit key, 1024 byte blocks): 1 operation in 1277 cycles (1024 bytes)
test 14 (256 bit key, 8192 byte blocks): 1 operation in 8745 cycles (8192 bytes)

testing speed of __ctr-aes-aesni decryption
test 0 (128 bit key, 16 byte blocks): 1 operation in 348 cycles (16 bytes)
test 1 (128 bit key, 64 byte blocks): 1 operation in 335 cycles (64 bytes)
test 2 (128 bit key, 256 byte blocks): 1 operation in 451 cycles (256 bytes)
test 3 (128 bit key, 1024 byte blocks): 1 operation in 1030 cycles (1024 bytes)
test 4 (128 bit key, 8192 byte blocks): 1 operation in 6611 cycles (8192 bytes)
test 5 (192 bit key, 16 byte blocks): 1 operation in 354 cycles (16 bytes)
test 6 (192 bit key, 64 byte blocks): 1 operation in 346 cycles (64 bytes)
test 7 (192 bit key, 256 byte blocks): 1 operation in 488 cycles (256 bytes)
test 8 (192 bit key, 1024 byte blocks): 1 operation in 1154 cycles (1024 bytes)
test 9 (192 bit key, 8192 byte blocks): 1 operation in 8390 cycles (8192 bytes)
test 10 (256 bit key, 16 byte blocks): 1 operation in 357 cycles (16 bytes)
test 11 (256 bit key, 64 byte blocks): 1 operation in 362 cycles (64 bytes)
test 12 (256 bit key, 256 byte blocks): 1 operation in 515 cycles (256 bytes)
test 13 (256 bit key, 1024 byte blocks): 1 operation in 1284 cycles (1024 bytes)
test 14 (256 bit key, 8192 byte blocks): 1 operation in 8681 cycles (8192 bytes)

Signed-off-by: Chandramouli Narayanan <mo...@linux.intel.com>
---
 arch/x86/crypto/Makefile                |   2 +-
 arch/x86/crypto/aes_ctrby8_avx-x86_64.S | 545 ++++++++++++++++++++++++++++++++
 arch/x86/crypto/aesni-intel_glue.c      |  41 ++-
 3 files changed, 586 insertions(+), 2 deletions(-)
 create mode 100644 arch/x86/crypto/aes_ctrby8_avx-x86_64.S

diff --git a/arch/x86/crypto/Makefile b/arch/x86/crypto/Makefile
index 61d6e28..f6fe1e2 100644
--- a/arch/x86/crypto/Makefile
+++ b/arch/x86/crypto/Makefile
@@ -76,7 +76,7 @@ ifeq ($(avx2_supported),yes)
 endif
 
 aesni-intel-y := aesni-intel_asm.o aesni-intel_glue.o fpu.o
-aesni-intel-$(CONFIG_64BIT) += aesni-intel_avx-x86_64.o
+aesni-intel-$(CONFIG_64BIT) += aesni-intel_avx-x86_64.o aes_ctrby8_avx-x86_64.o
 ghash-clmulni-intel-y := ghash-clmulni-intel_asm.o ghash-clmulni-intel_glue.o
 sha1-ssse3-y := sha1_ssse3_asm.o sha1_ssse3_glue.o
 ifeq ($(avx2_supported),yes)
diff --git a/arch/x86/crypto/aes_ctrby8_avx-x86_64.S 
b/arch/x86/crypto/aes_ctrby8_avx-x86_64.S
new file mode 100644
index 0000000..e49595f
--- /dev/null
+++ b/arch/x86/crypto/aes_ctrby8_avx-x86_64.S
@@ -0,0 +1,545 @@
+/*
+ *     Implement AES CTR mode by8 optimization with AVX instructions. (x86_64)
+ *
+ * This is AES128/192/256 CTR mode optimization implementation. It requires
+ * the support of Intel(R) AESNI and AVX instructions.
+ *
+ * This work was inspired by the AES CTR mode optimization published
+ * in Intel Optimized IPSEC Cryptograhpic library.
+ * Additional information on it can be found at:
+ *    http://downloadcenter.intel.com/Detail_Desc.aspx?agr=Y&DwnldID=22972
+ *
+ * This file is provided under a dual BSD/GPLv2 license.  When using or
+ * redistributing this file, you may do so under either license.
+ *
+ * GPL LICENSE SUMMARY
+ *
+ * Copyright(c) 2014 Intel Corporation.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of version 2 of the GNU General Public License as
+ * published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful, but
+ * WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * General Public License for more details.
+ *
+ * Contact Information:
+ * James Guilford <james.guilf...@intel.com>
+ * Sean Gulley <sean.m.gul...@intel.com>
+ * Chandramouli Narayanan <mo...@linux.intel.com>
+ *
+ * BSD LICENSE
+ *
+ * Copyright(c) 2014 Intel Corporation.
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions
+ * are met:
+ *
+ * Redistributions of source code must retain the above copyright
+ * notice, this list of conditions and the following disclaimer.
+ * Redistributions in binary form must reproduce the above copyright
+ * notice, this list of conditions and the following disclaimer in
+ * the documentation and/or other materials provided with the
+ * distribution.
+ * Neither the name of Intel Corporation nor the names of its
+ * contributors may be used to endorse or promote products derived
+ * from this software without specific prior written permission.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+ * "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+ * LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
+ * A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
+ * OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+ * SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
+ * LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
+ * DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
+ * THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ *
+ */
+
+#include <linux/linkage.h>
+#include <asm/inst.h>
+
+#define CONCAT(a,b)    a##b
+#define VMOVDQ         vmovdqu
+
+#define xdata0         %xmm0
+#define xdata1         %xmm1
+#define xdata2         %xmm2
+#define xdata3         %xmm3
+#define xdata4         %xmm4
+#define xdata5         %xmm5
+#define xdata6         %xmm6
+#define xdata7         %xmm7
+#define xcounter       %xmm8
+#define xbyteswap      %xmm9
+#define xkey0          %xmm10
+#define xkey3          %xmm11
+#define xkey6          %xmm12
+#define xkey9          %xmm13
+#define xkey4          %xmm11
+#define xkey8          %xmm12
+#define xkey12         %xmm13
+#define xkeyA          %xmm14
+#define xkeyB          %xmm15
+
+#define p_in           %rdi
+#define p_iv           %rsi
+#define p_keys         %rdx
+#define p_out          %rcx
+#define num_bytes      %r8
+
+#define tmp            %r10
+#define        DDQ(i)          CONCAT(ddq_add_,i)
+#define        XMM(i)          CONCAT(%xmm, i)
+#define        DDQ_DATA        0
+#define        XDATA           1
+#define KEY_128                1
+#define KEY_192                2
+#define KEY_256                3
+
+.section .data
+.align 16
+
+byteswap_const:
+       .octa 0x000102030405060708090A0B0C0D0E0F
+ddq_add_1:
+       .octa 0x00000000000000000000000000000001
+ddq_add_2:
+       .octa 0x00000000000000000000000000000002
+ddq_add_3:
+       .octa 0x00000000000000000000000000000003
+ddq_add_4:
+       .octa 0x00000000000000000000000000000004
+ddq_add_5:
+       .octa 0x00000000000000000000000000000005
+ddq_add_6:
+       .octa 0x00000000000000000000000000000006
+ddq_add_7:
+       .octa 0x00000000000000000000000000000007
+ddq_add_8:
+       .octa 0x00000000000000000000000000000008
+
+.text
+
+/* generate a unique variable for ddq_add_x */
+
+.macro setddq n
+       var_ddq_add = DDQ(\n)
+.endm
+
+/* generate a unique variable for xmm register */
+.macro setxdata n
+       var_xdata = XMM(\n)
+.endm
+
+/* club the numeric 'id' to the symbol 'name' */
+
+.macro club name, id
+.altmacro
+       .if \name == DDQ_DATA
+               setddq %\id
+       .elseif \name == XDATA
+               setxdata %\id
+       .endif
+.noaltmacro
+.endm
+
+/*
+ * do_aes num_in_par load_keys key_len
+ * This increments p_in, but not p_out
+ */
+.macro do_aes b, k, key_len
+       .set by, \b
+       .set load_keys, \k
+       .set klen, \key_len
+
+       .if (load_keys)
+               vmovdqa 0*16(p_keys), xkey0
+       .endif
+
+       vpshufb xbyteswap, xcounter, xdata0
+
+       .set i, 1
+       .rept (by - 1)
+               club DDQ_DATA, i
+               club XDATA, i
+               vpaddd  var_ddq_add(%rip), xcounter, var_xdata
+               vpshufb xbyteswap, var_xdata, var_xdata
+               .set i, (i +1)
+       .endr
+
+       vmovdqa 1*16(p_keys), xkeyA
+
+       vpxor   xkey0, xdata0, xdata0
+       club DDQ_DATA, by
+       vpaddd  var_ddq_add(%rip), xcounter, xcounter
+
+       .set i, 1
+       .rept (by - 1)
+               club XDATA, i
+               vpxor   xkey0, var_xdata, var_xdata
+               .set i, (i +1)
+       .endr
+
+       vmovdqa 2*16(p_keys), xkeyB
+
+       .set i, 0
+       .rept by
+               club XDATA, i
+               vaesenc xkeyA, var_xdata, var_xdata             /* key 1 */
+               .set i, (i +1)
+       .endr
+
+       .if (klen == KEY_128)
+               .if (load_keys)
+                       vmovdqa 3*16(p_keys), xkeyA
+               .endif
+       .else
+               vmovdqa 3*16(p_keys), xkeyA
+       .endif
+
+       .set i, 0
+       .rept by
+               club XDATA, i
+               vaesenc xkeyB, var_xdata, var_xdata             /* key 2 */
+               .set i, (i +1)
+       .endr
+
+       add     $(16*by), p_in
+
+       .if (klen == KEY_128)
+               vmovdqa 4*16(p_keys), xkey4
+       .else
+               .if (load_keys)
+                       vmovdqa 4*16(p_keys), xkey4
+               .endif
+       .endif
+
+       .set i, 0
+       .rept by
+               club XDATA, i
+               vaesenc xkeyA, var_xdata, var_xdata             /* key 3 */
+               .set i, (i +1)
+       .endr
+
+       vmovdqa 5*16(p_keys), xkeyA
+
+       .set i, 0
+       .rept by
+               club XDATA, i
+               vaesenc xkey4, var_xdata, var_xdata             /* key 4 */
+               .set i, (i +1)
+       .endr
+
+       .if (klen == KEY_128)
+               .if (load_keys)
+                       vmovdqa 6*16(p_keys), xkeyB
+               .endif
+       .else
+               vmovdqa 6*16(p_keys), xkeyB
+       .endif
+
+       .set i, 0
+       .rept by
+               club XDATA, i
+               vaesenc xkeyA, var_xdata, var_xdata             /* key 5 */
+               .set i, (i +1)
+       .endr
+
+       vmovdqa 7*16(p_keys), xkeyA
+
+       .set i, 0
+       .rept by
+               club XDATA, i
+               vaesenc xkeyB, var_xdata, var_xdata             /* key 6 */
+               .set i, (i +1)
+       .endr
+
+       .if (klen == KEY_128)
+               vmovdqa 8*16(p_keys), xkey8
+       .else
+               .if (load_keys)
+                       vmovdqa 8*16(p_keys), xkey8
+               .endif
+       .endif
+
+       .set i, 0
+       .rept by
+               club XDATA, i
+               vaesenc xkeyA, var_xdata, var_xdata             /* key 7 */
+               .set i, (i +1)
+       .endr
+
+       .if (klen == KEY_128)
+               .if (load_keys)
+                       vmovdqa 9*16(p_keys), xkeyA
+               .endif
+       .else
+               vmovdqa 9*16(p_keys), xkeyA
+       .endif
+
+       .set i, 0
+       .rept by
+               club XDATA, i
+               vaesenc xkey8, var_xdata, var_xdata             /* key 8 */
+               .set i, (i +1)
+       .endr
+
+       vmovdqa 10*16(p_keys), xkeyB
+
+       .set i, 0
+       .rept by
+               club XDATA, i
+               vaesenc xkeyA, var_xdata, var_xdata             /* key 9 */
+               .set i, (i +1)
+       .endr
+
+       .if (klen != KEY_128)
+               vmovdqa 11*16(p_keys), xkeyA
+       .endif
+
+       .set i, 0
+       .rept by
+               club XDATA, i
+               /* key 10 */
+               .if (klen == KEY_128)
+                       vaesenclast     xkeyB, var_xdata, var_xdata
+               .else
+                       vaesenc xkeyB, var_xdata, var_xdata
+               .endif
+               .set i, (i +1)
+       .endr
+
+       .if (klen != KEY_128)
+               .if (load_keys)
+                       vmovdqa 12*16(p_keys), xkey12
+               .endif
+
+               .set i, 0
+               .rept by
+                       club XDATA, i
+                       vaesenc xkeyA, var_xdata, var_xdata     /* key 11 */
+                       .set i, (i +1)
+               .endr
+
+               .if (klen == KEY_256)
+                       vmovdqa 13*16(p_keys), xkeyA
+               .endif
+
+               .set i, 0
+               .rept by
+                       club XDATA, i
+                       .if (klen == KEY_256)
+                               /* key 12 */
+                               vaesenc xkey12, var_xdata, var_xdata
+                       .else
+                               vaesenclast xkey12, var_xdata, var_xdata
+                       .endif
+                       .set i, (i +1)
+               .endr
+
+               .if (klen == KEY_256)
+                       vmovdqa 14*16(p_keys), xkeyB
+
+                       .set i, 0
+                       .rept by
+                               club XDATA, i
+                               /* key 13 */
+                               vaesenc xkeyA, var_xdata, var_xdata
+                               .set i, (i +1)
+                       .endr
+
+                       .set i, 0
+                       .rept by
+                               club XDATA, i
+                               /* key 14 */
+                               vaesenclast     xkeyB, var_xdata, var_xdata
+                               .set i, (i +1)
+                       .endr
+               .endif
+       .endif
+
+       .set i, 0
+       .rept (by / 2)
+               .set j, (i+1)
+               VMOVDQ  (i*16 - 16*by)(p_in), xkeyA
+               VMOVDQ  (j*16 - 16*by)(p_in), xkeyB
+               club XDATA, i
+               vpxor   xkeyA, var_xdata, var_xdata
+               club XDATA, j
+               vpxor   xkeyB, var_xdata, var_xdata
+               .set i, (i+2)
+       .endr
+
+       .if (i < by)
+               VMOVDQ  (i*16 - 16*by)(p_in), xkeyA
+               club XDATA, i
+               vpxor   xkeyA, var_xdata, var_xdata
+       .endif
+
+       .set i, 0
+       .rept by
+               club XDATA, i
+               VMOVDQ  var_xdata, i*16(p_out)
+               .set i, (i+1)
+       .endr
+.endm
+
+.macro do_aes_load val, key_len
+       do_aes \val, 1, \key_len
+.endm
+
+.macro do_aes_noload val, key_len
+       do_aes \val, 0, \key_len
+.endm
+
+/* main body of aes ctr load */
+
+.macro do_aes_ctrmain key_len
+
+       cmp     $16, num_bytes
+       jb      .Ldo_return2\key_len
+
+       vmovdqa byteswap_const(%rip), xbyteswap
+       vmovdqu (p_iv), xcounter
+       vpshufb xbyteswap, xcounter, xcounter
+
+       mov     num_bytes, tmp
+       and     $(7*16), tmp
+       jz      .Lmult_of_8_blks\key_len
+
+       /* 1 <= tmp <= 7 */
+       cmp     $(4*16), tmp
+       jg      .Lgt4\key_len
+       je      .Leq4\key_len
+
+.Llt4\key_len:
+       cmp     $(2*16), tmp
+       jg      .Leq3\key_len
+       je      .Leq2\key_len
+
+.Leq1\key_len:
+       do_aes_load     1, \key_len
+       add     $(1*16), p_out
+       and     $(~7*16), num_bytes
+       jz      .Ldo_return2\key_len
+       jmp     .Lmain_loop2\key_len
+
+.Leq2\key_len:
+       do_aes_load     2, \key_len
+       add     $(2*16), p_out
+       and     $(~7*16), num_bytes
+       jz      .Ldo_return2\key_len
+       jmp     .Lmain_loop2\key_len
+
+
+.Leq3\key_len:
+       do_aes_load     3, \key_len
+       add     $(3*16), p_out
+       and     $(~7*16), num_bytes
+       jz      .Ldo_return2\key_len
+       jmp     .Lmain_loop2\key_len
+
+.Leq4\key_len:
+       do_aes_load     4, \key_len
+       add     $(4*16), p_out
+       and     $(~7*16), num_bytes
+       jz      .Ldo_return2\key_len
+       jmp     .Lmain_loop2\key_len
+
+.Lgt4\key_len:
+       cmp     $(6*16), tmp
+       jg      .Leq7\key_len
+       je      .Leq6\key_len
+
+.Leq5\key_len:
+       do_aes_load     5, \key_len
+       add     $(5*16), p_out
+       and     $(~7*16), num_bytes
+       jz      .Ldo_return2\key_len
+       jmp     .Lmain_loop2\key_len
+
+.Leq6\key_len:
+       do_aes_load     6, \key_len
+       add     $(6*16), p_out
+       and     $(~7*16), num_bytes
+       jz      .Ldo_return2\key_len
+       jmp     .Lmain_loop2\key_len
+
+.Leq7\key_len:
+       do_aes_load     7, \key_len
+       add     $(7*16), p_out
+       and     $(~7*16), num_bytes
+       jz      .Ldo_return2\key_len
+       jmp     .Lmain_loop2\key_len
+
+.Lmult_of_8_blks\key_len:
+       .if (\key_len != KEY_128)
+               vmovdqa 0*16(p_keys), xkey0
+               vmovdqa 4*16(p_keys), xkey4
+               vmovdqa 8*16(p_keys), xkey8
+               vmovdqa 12*16(p_keys), xkey12
+       .else
+               vmovdqa 0*16(p_keys), xkey0
+               vmovdqa 3*16(p_keys), xkey4
+               vmovdqa 6*16(p_keys), xkey8
+               vmovdqa 9*16(p_keys), xkey12
+       .endif
+.Lmain_loop2\key_len:
+       /* num_bytes is a multiple of 8 and >0 */
+       do_aes_noload   8, \key_len
+       add     $(8*16), p_out
+       sub     $(8*16), num_bytes
+       jne     .Lmain_loop2\key_len
+
+.Ldo_return2\key_len:
+       /* return updated IV */
+       vpshufb xbyteswap, xcounter, xcounter
+       vmovdqu xcounter, (p_iv)
+       ret
+.endm
+
+/*
+ * routine to do AES128 CTR enc/decrypt "by8"
+ * XMM registers are clobbered.
+ * Saving/restoring must be done at a higher level
+ * aes_ctr_enc_128_avx_by8(void *in, void *iv, void *keys, void *out,
+ *                     unsigned int num_bytes)
+ */
+ENTRY(aes_ctr_enc_128_avx_by8)
+       /* call the aes main loop */
+       do_aes_ctrmain KEY_128
+
+ENDPROC(aes_ctr_enc_128_avx_by8)
+
+/*
+ * routine to do AES192 CTR enc/decrypt "by8"
+ * XMM registers are clobbered.
+ * Saving/restoring must be done at a higher level
+ * aes_ctr_enc_192_avx_by8(void *in, void *iv, void *keys, void *out,
+ *                     unsigned int num_bytes)
+ */
+ENTRY(aes_ctr_enc_192_avx_by8)
+       /* call the aes main loop */
+       do_aes_ctrmain KEY_192
+
+ENDPROC(aes_ctr_enc_192_avx_by8)
+
+/*
+ * routine to do AES256 CTR enc/decrypt "by8"
+ * XMM registers are clobbered.
+ * Saving/restoring must be done at a higher level
+ * aes_ctr_enc_256_avx_by8(void *in, void *iv, void *keys, void *out,
+ *                     unsigned int num_bytes)
+ */
+ENTRY(aes_ctr_enc_256_avx_by8)
+       /* call the aes main loop */
+       do_aes_ctrmain KEY_256
+
+ENDPROC(aes_ctr_enc_256_avx_by8)
diff --git a/arch/x86/crypto/aesni-intel_glue.c 
b/arch/x86/crypto/aesni-intel_glue.c
index 948ad0e..b06e20f 100644
--- a/arch/x86/crypto/aesni-intel_glue.c
+++ b/arch/x86/crypto/aesni-intel_glue.c
@@ -105,6 +105,9 @@ void crypto_fpu_exit(void);
 #define AVX_GEN4_OPTSIZE 4096
 
 #ifdef CONFIG_X86_64
+
+static void (*aesni_ctr_enc_tfm)(struct crypto_aes_ctx *ctx, u8 *out,
+                             const u8 *in, unsigned int len, u8 *iv);
 asmlinkage void aesni_ctr_enc(struct crypto_aes_ctx *ctx, u8 *out,
                              const u8 *in, unsigned int len, u8 *iv);
 
@@ -154,6 +157,15 @@ asmlinkage void aesni_gcm_dec(void *ctx, u8 *out,
                        u8 *auth_tag, unsigned long auth_tag_len);
 

+#if defined(CONFIG_AS_AVX)
+asmlinkage void aes_ctr_enc_128_avx_by8(const u8 *in, u8 *iv,
+               void *keys, u8 *out, unsigned int num_bytes);
+asmlinkage void aes_ctr_enc_192_avx_by8(const u8 *in, u8 *iv,
+               void *keys, u8 *out, unsigned int num_bytes);
+asmlinkage void aes_ctr_enc_256_avx_by8(const u8 *in, u8 *iv,
+               void *keys, u8 *out, unsigned int num_bytes);
+#endif
+
 #ifdef CONFIG_AS_AVX
 /*
  * asmlinkage void aesni_gcm_precomp_avx_gen2()
@@ -472,6 +484,25 @@ static void ctr_crypt_final(struct crypto_aes_ctx *ctx,
        crypto_inc(ctrblk, AES_BLOCK_SIZE);
 }
 
+#if defined(CONFIG_AS_AVX)
+static void aesni_ctr_enc_avx_tfm(struct crypto_aes_ctx *ctx, u8 *out,
+                             const u8 *in, unsigned int len, u8 *iv)
+{
+       /*
+        * based on key length, override with the by8 version
+        * of ctr mode encryption/decryption for improved performance
+        */
+       if (ctx->key_length == AES_KEYSIZE_128)
+               aes_ctr_enc_128_avx_by8(in, iv, (void *)ctx, out, len);
+       else if (ctx->key_length == AES_KEYSIZE_192)
+               aes_ctr_enc_192_avx_by8(in, iv, (void *)ctx, out, len);
+       else if (ctx->key_length == AES_KEYSIZE_256)
+               aes_ctr_enc_256_avx_by8(in, iv, (void *)ctx, out, len);
+       else
+               aesni_ctr_enc(ctx, out, in, len, iv);
+}
+#endif
+
 static int ctr_crypt(struct blkcipher_desc *desc,
                     struct scatterlist *dst, struct scatterlist *src,
                     unsigned int nbytes)
@@ -486,7 +517,7 @@ static int ctr_crypt(struct blkcipher_desc *desc,
 
        kernel_fpu_begin();
        while ((nbytes = walk.nbytes) >= AES_BLOCK_SIZE) {
-               aesni_ctr_enc(ctx, walk.dst.virt.addr, walk.src.virt.addr,
+               aesni_ctr_enc_tfm(ctx, walk.dst.virt.addr, walk.src.virt.addr,
                              nbytes & AES_BLOCK_MASK, walk.iv);
                nbytes &= AES_BLOCK_SIZE - 1;
                err = blkcipher_walk_done(desc, &walk, nbytes);
@@ -1493,6 +1524,14 @@ static int __init aesni_init(void)
                aesni_gcm_enc_tfm = aesni_gcm_enc;
                aesni_gcm_dec_tfm = aesni_gcm_dec;
        }
+       aesni_ctr_enc_tfm = aesni_ctr_enc;
+#if defined(CONFIG_AS_AVX)
+       if (boot_cpu_has(X86_FEATURE_AES) && boot_cpu_has(X86_FEATURE_AVX)) {
+               /* optimize performance of ctr mode encryption trasform */
+               aesni_ctr_enc_tfm = aesni_ctr_enc_avx_tfm;
+               pr_info("AES CTR mode optimization enabled\n");
+       }
+#endif
 #endif
 
        err = crypto_fpu_init();
-- 
1.8.2.1


--
To unsubscribe from this list: send the line "unsubscribe linux-crypto" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v2 1/1] crypto: AES CTR x86_64 "by8" AVX optimization

Reply via email to