https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98856
--- Comment #10 from Richard Biener <rguenth at gcc dot gnu.org> --- (In reply to Jakub Jelinek from comment #9) > For arithmetic >> (element_precision - 1) one can just use > {,v}pxor + {,v}pcmpgtq, as in instead of return vec >> 63; do return vec < 0; > (in C++-ish way), aka VEC_COND_EXPR vec < 0, { all ones }, { 0 } > For other arithmetic shifts by scalar constant, perhaps one can replace > return vec >> 17; with return (vectype) ((uvectype) vec >> 17) | ((vec < 0) > << (64 - 17)); > - it will actually work even for non-constant scalar shift amounts because > {,v}psllq treats shift counts > 63 as 0. OK, so that yields poly_double_le2: .LFB0: .cfi_startproc vmovdqu (%rsi), %xmm0 vpxor %xmm1, %xmm1, %xmm1 vpalignr $8, %xmm0, %xmm0, %xmm2 vpcmpgtq %xmm2, %xmm1, %xmm1 vpand .LC0(%rip), %xmm1, %xmm1 vpsllq $1, %xmm0, %xmm0 vpxor %xmm1, %xmm0, %xmm0 vmovdqu %xmm0, (%rdi) ret when I feed the following to SLP2 directly: void __GIMPLE (ssa,guessed_local(1073741824),startwith("slp")) poly_double_le2 (unsigned char * out, const unsigned char * in) { long unsigned int carry; long unsigned int _1; long unsigned int _2; long unsigned int _3; long unsigned int _4; long unsigned int _5; long unsigned int _6; __int128 unsigned _9; long unsigned int _14; long unsigned int _15; long int _18; long int _19; long unsigned int _20; __BB(2,guessed_local(1073741824)): _9 = __MEM <__int128 unsigned, 8> ((char *)in_8(D)); _14 = __BIT_FIELD_REF <long unsigned int> (_9, 64u, 64u); _18 = (long int) _14; _1 = _18 < 0l ? _Literal (unsigned long) -1ul : 0ul; carry_10 = _1 & 135ul; _2 = _14 << 1; _15 = __BIT_FIELD_REF <long unsigned int> (_9, 64u, 0u); _19 = (long int) _15; _20 = _19 < 0l ? _Literal (unsigned long) -1ul : 0ul; _3 = _20 & 1ul; _4 = _2 ^ _3; _5 = _15 << 1; _6 = _5 ^ carry_10; __MEM <long unsigned int, 8> ((char *)out_11(D)) = _6; __MEM <long unsigned int, 8> ((char *)out_11(D) + _Literal (char *) 8) = _4; return; } with <bb 2> [local count: 1073741824]: _9 = MEM <__int128 unsigned> [(char *)in_8(D)]; _12 = VIEW_CONVERT_EXPR<vector(2) long unsigned int>(_9); _7 = VEC_PERM_EXPR <_12, _12, { 1, 0 }>; vect__18.1_25 = VIEW_CONVERT_EXPR<vector(2) long int>(_7); vect_carry_10.3_28 = .VCOND (vect__18.1_25, { 0, 0 }, { 135, 1 }, { 0, 0 }, 108); vect__5.0_13 = _12 << 1; vect__6.4_29 = vect__5.0_13 ^ vect_carry_10.3_28; MEM <vector(2) long unsigned int> [(char *)out_11(D)] = vect__6.4_29; return; in .optimized The latency of the data is at least 7 instructions that way, compared to 4 in the not vectorized code (guess I could try Intel iaca on it). So if that's indeed the best we can do then it's not profitable (btw, with the above the vectorizers conclusion is not profitable but due to excessive costing of constants for the condition vectorization). Simple asm replacement of the kernel results in ES-128/XTS 292740 key schedule/sec; 0.00 ms/op 11571 cycles/op (2 ops in 0 ms) AES-128/XTS encrypt buffer size 1024 bytes: 765.571 MiB/sec 4.62 cycles/byte (382.79 MiB in 500.00 ms) AES-128/XTS decrypt buffer size 1024 bytes: 767.064 MiB/sec 4.61 cycles/byte (382.79 MiB in 499.03 ms) compared to AES-128/XTS 283527 key schedule/sec; 0.00 ms/op 11932 cycles/op (2 ops in 0 ms) AES-128/XTS encrypt buffer size 1024 bytes: 768.446 MiB/sec 4.60 cycles/byte (384.22 MiB in 500.00 ms) AES-128/XTS decrypt buffer size 1024 bytes: 769.292 MiB/sec 4.60 cycles/byte (384.22 MiB in 499.45 ms) so that's indeed no improvement. Bigger block sizes also contain vector code but that's not exercised by the botan speed measurement.