Hi Uros,
I believe the proposed sequences should be dramatically faster than LLVM's
implementation(s), due to the large latencies required to move values between
the vector and scalar parts on modern x86_64 microarchitectures.  All of the
SSE2 instructions used in the sequences proposed by my patch have single
cycle latencies, so have a maximum total latency of 5 cycles, though due to
multiple issue, typically require between 1 and 3 cycles depending up the 
sequence.

Moving between units is significantly slower; according to Agner Fog's tables,
the pinsrq/pextrq instructions you suggest have latencies up to 7 cycles on the
Silvermont architecture.  Let's take the LLVM code you've provided, and 
annotate with cycle counts for a recent Intel (cascadelake) and recent AMD
(zen2) CPUs.

movq    %xmm0, %rax             ; 2-3 cycles
pshufd  $78, %xmm0, %xmm0       ; 1 cycle
movq    %xmm0, %rcx             ; 2-3 cycles
shldq   $8, %rax, %rcx          ; 3 cycles
shlq    $8, %rax                        ; 1 cycle
movq    %rcx, %xmm1             ; 2-3 cycles
movq    %rax, %xmm0             ; 2-3 cycles
punpcklqdq      %xmm1, %xmm0    ; 1 cycle

This 8 instruction sequence has a total latency of 14 cycles on CascadeLake and
18 cycles on Zen2, but an scheduled cycle count of 9 cycles and 11 cycles 
respectively.

The same left shift by 8 as implemented by the proposed patch is:

pslldq  $1, %xmm0               ; 1 cycle

And for reference, the code currently generated by GCC is:

movaps  %xmm0, -24(%rsp)        ; 3 cycles
movq    -24(%rsp), %rax         ; 2 cycles
movq    -16(%rsp), %rdx         ; 2 cycles
shldq   $8, %rax, %rdx          ; 3 cycles
salq    $8, %rax                        ; 1 cycle
movq    %rax, -24(%rsp)         ; 2 cycles
movq    %rdx, -16(%rsp)         ; 2 cycles
movdqa  -24(%rsp), %xmm0        ; 2 cycles


The very worst case timing of my patches is the five instruction rotate:
pshufd  $78, %xmm0, %xmm1       ; 1 cycle
pshufd  $57, %xmm0, %xmm0       ; 1 cycle
pslld   $1, %xmm1               ; 1 cycle
psrld   $31, %xmm0              ; 1 cycle
por     %xmm1, %xmm0            ; 1 cycle

which has 5 cycle total latency, but can complete in 3 cycles when suitably
scheduled as the pshufd can execute concurrently, as then can the two shifts,
finally followed by the por.

Perhaps I'm missing something, but I'd expect this patch to be three or
four times faster, on recent hardware, than the code generated by LLVM.

Let me know if you'd like me to run microbenchmarks, but the documented
timings are such a dramatic improvement, I'm a little surprised you've
asked about performance.  My patch is also a code size win with -Os
(ashl_8 is currently 39 bytes, shrinks to 5 bytes with this patch).


Please let me know what you think.
Roger
--

-----Original Message-----
From: Uros Bizjak <ubiz...@gmail.com> 
Sent: 25 October 2021 09:02
To: Roger Sayle <ro...@nextmovesoftware.com>
Cc: GCC Patches <gcc-patches@gcc.gnu.org>
Subject: Re: [PATCH] x86_64: Implement V1TI mode shifts/rotates by a constant

On Sun, Oct 24, 2021 at 6:34 PM Roger Sayle <ro...@nextmovesoftware.com> wrote:
>
>
> This patch provides RTL expanders to implement logical shifts and 
> rotates of 128-bit values (stored in vector integer registers) by 
> constant bit counts.  Previously, GCC would transfer these values to a 
> pair of scalar registers (TImode) via memory to perform the operation, 
> then transfer the result back via memory.  Instead these operations 
> are now expanded using (between 1 and 5) SSE2 vector instructions.

Hm, instead of using memory (without STL forwarding for general -> XMM
moves!) these should use something similar to what clang produces (or use 
pextrq/pinsrq, at least with SSE4.1):

       movq    %xmm0, %rax
       pshufd  $78, %xmm0, %xmm0
       movq    %xmm0, %rcx
       shldq   $8, %rax, %rcx
       shlq    $8, %rax
       movq    %rcx, %xmm1
       movq    %rax, %xmm0
       punpcklqdq      %xmm1, %xmm0

> Logical shifts by multiples of 8 can be implemented using x86_64's 
> pslldq/psrldq instruction:
> ashl_8: pslldq  $1, %xmm0
>         ret
> lshr_32:
>         psrldq  $4, %xmm0
>         ret
>
> Logical shifts by greater than 64 can use pslldq/psrldq $8, followed 
> by a psllq/psrlq for the remaining bits:
> ashl_111:
>         pslldq  $8, %xmm0
>         psllq   $47, %xmm0
>         ret
> lshr_127:
>         psrldq  $8, %xmm0
>         psrlq   $63, %xmm0
>         ret
>
> The remaining logical shifts make use of the following idiom:
> ashl_1:
>         movdqa  %xmm0, %xmm1
>         psllq   $1, %xmm0
>         pslldq  $8, %xmm1
>         psrlq   $63, %xmm1
>         por     %xmm1, %xmm0
>         ret
> lshr_15:
>         movdqa  %xmm0, %xmm1
>         psrlq   $15, %xmm0
>         psrldq  $8, %xmm1
>         psllq   $49, %xmm1
>         por     %xmm1, %xmm0
>         ret
>
> Rotates by multiples of 32 can use x86_64's pshufd:
> rotr_32:
>         pshufd  $57, %xmm0, %xmm0
>         ret
> rotr_64:
>         pshufd  $78, %xmm0, %xmm0
>         ret
> rotr_96:
>         pshufd  $147, %xmm0, %xmm0
>         ret
>
> Rotates by multiples of 8 (other than multiples of 32) can make use of 
> both pslldq and psrldq, followed by por:
> rotr_8:
>         movdqa  %xmm0, %xmm1
>         psrldq  $1, %xmm0
>         pslldq  $15, %xmm1
>         por     %xmm1, %xmm0
>         ret
> rotr_112:
>         movdqa  %xmm0, %xmm1
>         psrldq  $14, %xmm0
>         pslldq  $2, %xmm1
>         por     %xmm1, %xmm0
>         ret
>
> And the remaining rotates use one or two pshufd, followed by a 
> psrld/pslld/por sequence:
> rotr_1:
>         movdqa  %xmm0, %xmm1
>         pshufd  $57, %xmm0, %xmm0
>         psrld   $1, %xmm1
>         pslld   $31, %xmm0
>         por     %xmm1, %xmm0
>         ret
> rotr_63:
>         pshufd  $78, %xmm0, %xmm1
>         pshufd  $57, %xmm0, %xmm0
>         pslld   $1, %xmm1
>         psrld   $31, %xmm0
>         por     %xmm1, %xmm0
>         ret
> rotr_111:
>         pshufd  $147, %xmm0, %xmm1
>         pslld   $17, %xmm0
>         psrld   $15, %xmm1
>         por     %xmm1, %xmm0
>         ret
>
> The new test case, sse2-v1ti-shift.c, is a run-time check to confirm 
> that the results of V1TImode shifts/rotates by constants, exactly 
> match the expected results of TImode operations, for various input test 
> vectors.

Is the sequence of 4+ SSE instructions really faster than pinsrq/pextrq (and 
two movq insn) + two operations on integer registers?

Uros.

> This patch has been tested on x86_64-pc-linux-gnu with make bootstrap 
> and make -k check with no new failures.  Ok for mainline?
>
>
> 2021-10-24  Roger Sayle  <ro...@nextmovesoftware.com>
>
> gcc/ChangeLog
>         * config/i386/i386-expand.c (ix86_expand_v1ti_shift): New helper
>         function to expand V1TI mode logical shifts by integer constants.
>         (ix86_expand_v1ti_rotate): New helper function to expand V1TI
>         mode rotations by integer constants.
>         * config/i386/i386-protos.h (ix86_expand_v1ti_shift,
>         ix86_expand_v1ti_rotate): Prototype new functions here.
>         * config/i386/sse.md (ashlv1ti3, lshrv1ti3, rotlv1ti3, rotrv1ti3):
>         New TARGET_SSE2 expanders to implement V1TI shifts and rotations.
>
> gcc/testsuite/ChangeLog
>         * gcc.target/i386/sse2-v1ti-shift.c: New test case.
>
>
> Thanks in advance,
> Roger
> --
>

Reply via email to