https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98856

--- Comment #10 from Richard Biener <rguenth at gcc dot gnu.org> ---
(In reply to Jakub Jelinek from comment #9)
> For arithmetic >> (element_precision - 1) one can just use
> {,v}pxor + {,v}pcmpgtq, as in instead of return vec >> 63; do return vec < 0;
> (in C++-ish way), aka VEC_COND_EXPR vec < 0, { all ones }, { 0 }
> For other arithmetic shifts by scalar constant, perhaps one can replace
> return vec >> 17; with return (vectype) ((uvectype) vec >> 17) | ((vec < 0)
> << (64 - 17));
> - it will actually work even for non-constant scalar shift amounts because
> {,v}psllq treats shift counts > 63 as 0.

OK, so that yields

poly_double_le2:
.LFB0:
        .cfi_startproc
        vmovdqu (%rsi), %xmm0
        vpxor   %xmm1, %xmm1, %xmm1
        vpalignr        $8, %xmm0, %xmm0, %xmm2
        vpcmpgtq        %xmm2, %xmm1, %xmm1
        vpand   .LC0(%rip), %xmm1, %xmm1
        vpsllq  $1, %xmm0, %xmm0
        vpxor   %xmm1, %xmm0, %xmm0
        vmovdqu %xmm0, (%rdi)
        ret

when I feed the following to SLP2 directly:

void __GIMPLE (ssa,guessed_local(1073741824),startwith("slp"))
poly_double_le2 (unsigned char * out, const unsigned char * in)
{
  long unsigned int carry;
  long unsigned int _1;
  long unsigned int _2;
  long unsigned int _3;
  long unsigned int _4;
  long unsigned int _5;
  long unsigned int _6;
  __int128 unsigned _9;
  long unsigned int _14;
  long unsigned int _15;
  long int _18;
  long int _19;
  long unsigned int _20;

  __BB(2,guessed_local(1073741824)):
  _9 = __MEM <__int128 unsigned, 8> ((char *)in_8(D));
  _14 = __BIT_FIELD_REF <long unsigned int> (_9, 64u, 64u);
  _18 = (long int) _14;
  _1 = _18 < 0l ? _Literal (unsigned long) -1ul : 0ul;
  carry_10 = _1 & 135ul;
  _2 = _14 << 1;
  _15 = __BIT_FIELD_REF <long unsigned int> (_9, 64u, 0u);
  _19 = (long int) _15;
  _20 = _19 < 0l ? _Literal (unsigned long) -1ul : 0ul;
  _3 = _20 & 1ul;
  _4 = _2 ^ _3;
  _5 = _15 << 1;
  _6 = _5 ^ carry_10;
  __MEM <long unsigned int, 8> ((char *)out_11(D)) = _6;
  __MEM <long unsigned int, 8> ((char *)out_11(D) + _Literal (char *) 8) = _4;
  return;

}

with

  <bb 2> [local count: 1073741824]:
  _9 = MEM <__int128 unsigned> [(char *)in_8(D)];
  _12 = VIEW_CONVERT_EXPR<vector(2) long unsigned int>(_9);
  _7 = VEC_PERM_EXPR <_12, _12, { 1, 0 }>;
  vect__18.1_25 = VIEW_CONVERT_EXPR<vector(2) long int>(_7);
  vect_carry_10.3_28 = .VCOND (vect__18.1_25, { 0, 0 }, { 135, 1 }, { 0, 0 },
108);
  vect__5.0_13 = _12 << 1;
  vect__6.4_29 = vect__5.0_13 ^ vect_carry_10.3_28;
  MEM <vector(2) long unsigned int> [(char *)out_11(D)] = vect__6.4_29;
  return;

in .optimized

The latency of the data is at least 7 instructions that way, compared to
4 in the not vectorized code (guess I could try Intel iaca on it).

So if that's indeed the best we can do then it's not profitable (btw,
with the above the vectorizers conclusion is not profitable but due
to excessive costing of constants for the condition vectorization).

Simple asm replacement of the kernel results in

ES-128/XTS 292740 key schedule/sec; 0.00 ms/op 11571 cycles/op (2 ops in 0 ms)
AES-128/XTS encrypt buffer size 1024 bytes: 765.571 MiB/sec 4.62 cycles/byte
(382.79 MiB in 500.00 ms)
AES-128/XTS decrypt buffer size 1024 bytes: 767.064 MiB/sec 4.61 cycles/byte
(382.79 MiB in 499.03 ms)

compared to

AES-128/XTS 283527 key schedule/sec; 0.00 ms/op 11932 cycles/op (2 ops in 0 ms)
AES-128/XTS encrypt buffer size 1024 bytes: 768.446 MiB/sec 4.60 cycles/byte
(384.22 MiB in 500.00 ms)
AES-128/XTS decrypt buffer size 1024 bytes: 769.292 MiB/sec 4.60 cycles/byte
(384.22 MiB in 499.45 ms)

so that's indeed no improvement.  Bigger block sizes also contain vector
code but that's not exercised by the botan speed measurement.

Reply via email to