Hi Uros, I believe the proposed sequences should be dramatically faster than LLVM's implementation(s), due to the large latencies required to move values between the vector and scalar parts on modern x86_64 microarchitectures. All of the SSE2 instructions used in the sequences proposed by my patch have single cycle latencies, so have a maximum total latency of 5 cycles, though due to multiple issue, typically require between 1 and 3 cycles depending up the sequence.
Moving between units is significantly slower; according to Agner Fog's tables, the pinsrq/pextrq instructions you suggest have latencies up to 7 cycles on the Silvermont architecture. Let's take the LLVM code you've provided, and annotate with cycle counts for a recent Intel (cascadelake) and recent AMD (zen2) CPUs. movq %xmm0, %rax ; 2-3 cycles pshufd $78, %xmm0, %xmm0 ; 1 cycle movq %xmm0, %rcx ; 2-3 cycles shldq $8, %rax, %rcx ; 3 cycles shlq $8, %rax ; 1 cycle movq %rcx, %xmm1 ; 2-3 cycles movq %rax, %xmm0 ; 2-3 cycles punpcklqdq %xmm1, %xmm0 ; 1 cycle This 8 instruction sequence has a total latency of 14 cycles on CascadeLake and 18 cycles on Zen2, but an scheduled cycle count of 9 cycles and 11 cycles respectively. The same left shift by 8 as implemented by the proposed patch is: pslldq $1, %xmm0 ; 1 cycle And for reference, the code currently generated by GCC is: movaps %xmm0, -24(%rsp) ; 3 cycles movq -24(%rsp), %rax ; 2 cycles movq -16(%rsp), %rdx ; 2 cycles shldq $8, %rax, %rdx ; 3 cycles salq $8, %rax ; 1 cycle movq %rax, -24(%rsp) ; 2 cycles movq %rdx, -16(%rsp) ; 2 cycles movdqa -24(%rsp), %xmm0 ; 2 cycles The very worst case timing of my patches is the five instruction rotate: pshufd $78, %xmm0, %xmm1 ; 1 cycle pshufd $57, %xmm0, %xmm0 ; 1 cycle pslld $1, %xmm1 ; 1 cycle psrld $31, %xmm0 ; 1 cycle por %xmm1, %xmm0 ; 1 cycle which has 5 cycle total latency, but can complete in 3 cycles when suitably scheduled as the pshufd can execute concurrently, as then can the two shifts, finally followed by the por. Perhaps I'm missing something, but I'd expect this patch to be three or four times faster, on recent hardware, than the code generated by LLVM. Let me know if you'd like me to run microbenchmarks, but the documented timings are such a dramatic improvement, I'm a little surprised you've asked about performance. My patch is also a code size win with -Os (ashl_8 is currently 39 bytes, shrinks to 5 bytes with this patch). Please let me know what you think. Roger -- -----Original Message----- From: Uros Bizjak <ubiz...@gmail.com> Sent: 25 October 2021 09:02 To: Roger Sayle <ro...@nextmovesoftware.com> Cc: GCC Patches <gcc-patches@gcc.gnu.org> Subject: Re: [PATCH] x86_64: Implement V1TI mode shifts/rotates by a constant On Sun, Oct 24, 2021 at 6:34 PM Roger Sayle <ro...@nextmovesoftware.com> wrote: > > > This patch provides RTL expanders to implement logical shifts and > rotates of 128-bit values (stored in vector integer registers) by > constant bit counts. Previously, GCC would transfer these values to a > pair of scalar registers (TImode) via memory to perform the operation, > then transfer the result back via memory. Instead these operations > are now expanded using (between 1 and 5) SSE2 vector instructions. Hm, instead of using memory (without STL forwarding for general -> XMM moves!) these should use something similar to what clang produces (or use pextrq/pinsrq, at least with SSE4.1): movq %xmm0, %rax pshufd $78, %xmm0, %xmm0 movq %xmm0, %rcx shldq $8, %rax, %rcx shlq $8, %rax movq %rcx, %xmm1 movq %rax, %xmm0 punpcklqdq %xmm1, %xmm0 > Logical shifts by multiples of 8 can be implemented using x86_64's > pslldq/psrldq instruction: > ashl_8: pslldq $1, %xmm0 > ret > lshr_32: > psrldq $4, %xmm0 > ret > > Logical shifts by greater than 64 can use pslldq/psrldq $8, followed > by a psllq/psrlq for the remaining bits: > ashl_111: > pslldq $8, %xmm0 > psllq $47, %xmm0 > ret > lshr_127: > psrldq $8, %xmm0 > psrlq $63, %xmm0 > ret > > The remaining logical shifts make use of the following idiom: > ashl_1: > movdqa %xmm0, %xmm1 > psllq $1, %xmm0 > pslldq $8, %xmm1 > psrlq $63, %xmm1 > por %xmm1, %xmm0 > ret > lshr_15: > movdqa %xmm0, %xmm1 > psrlq $15, %xmm0 > psrldq $8, %xmm1 > psllq $49, %xmm1 > por %xmm1, %xmm0 > ret > > Rotates by multiples of 32 can use x86_64's pshufd: > rotr_32: > pshufd $57, %xmm0, %xmm0 > ret > rotr_64: > pshufd $78, %xmm0, %xmm0 > ret > rotr_96: > pshufd $147, %xmm0, %xmm0 > ret > > Rotates by multiples of 8 (other than multiples of 32) can make use of > both pslldq and psrldq, followed by por: > rotr_8: > movdqa %xmm0, %xmm1 > psrldq $1, %xmm0 > pslldq $15, %xmm1 > por %xmm1, %xmm0 > ret > rotr_112: > movdqa %xmm0, %xmm1 > psrldq $14, %xmm0 > pslldq $2, %xmm1 > por %xmm1, %xmm0 > ret > > And the remaining rotates use one or two pshufd, followed by a > psrld/pslld/por sequence: > rotr_1: > movdqa %xmm0, %xmm1 > pshufd $57, %xmm0, %xmm0 > psrld $1, %xmm1 > pslld $31, %xmm0 > por %xmm1, %xmm0 > ret > rotr_63: > pshufd $78, %xmm0, %xmm1 > pshufd $57, %xmm0, %xmm0 > pslld $1, %xmm1 > psrld $31, %xmm0 > por %xmm1, %xmm0 > ret > rotr_111: > pshufd $147, %xmm0, %xmm1 > pslld $17, %xmm0 > psrld $15, %xmm1 > por %xmm1, %xmm0 > ret > > The new test case, sse2-v1ti-shift.c, is a run-time check to confirm > that the results of V1TImode shifts/rotates by constants, exactly > match the expected results of TImode operations, for various input test > vectors. Is the sequence of 4+ SSE instructions really faster than pinsrq/pextrq (and two movq insn) + two operations on integer registers? Uros. > This patch has been tested on x86_64-pc-linux-gnu with make bootstrap > and make -k check with no new failures. Ok for mainline? > > > 2021-10-24 Roger Sayle <ro...@nextmovesoftware.com> > > gcc/ChangeLog > * config/i386/i386-expand.c (ix86_expand_v1ti_shift): New helper > function to expand V1TI mode logical shifts by integer constants. > (ix86_expand_v1ti_rotate): New helper function to expand V1TI > mode rotations by integer constants. > * config/i386/i386-protos.h (ix86_expand_v1ti_shift, > ix86_expand_v1ti_rotate): Prototype new functions here. > * config/i386/sse.md (ashlv1ti3, lshrv1ti3, rotlv1ti3, rotrv1ti3): > New TARGET_SSE2 expanders to implement V1TI shifts and rotations. > > gcc/testsuite/ChangeLog > * gcc.target/i386/sse2-v1ti-shift.c: New test case. > > > Thanks in advance, > Roger > -- >