Hi Roger,

It was nice to meet you too.

Thank you in looking into the ARC's non-Barrel Shifter configurations.  I will 
dive into your patch asap, but before starting here are a few of my comments: 

-----Original Message-----
From: Roger Sayle <ro...@nextmovesoftware.com> 
Sent: Thursday, September 28, 2023 2:27 PM
To: gcc-patches@gcc.gnu.org
Cc: Claudiu Zissulescu <claz...@synopsys.com>
Subject: [ARC PATCH] Split SImode shifts pre-reload on !TARGET_BARREL_SHIFTER.


Hi Claudiu,
It was great meeting up with you and the Synopsys ARC team at the GNU tools 
Cauldron in Cambridge.

This patch is the first in a series to improve SImode and DImode shifts and 
rotates in the ARC backend.  This first piece splits SImode shifts, for 
!TARGET_BARREL_SHIFTER targets, after combine and before reload, in the split1 
pass, as suggested by the FIXME comment above output_shift in arc.cc.  To do 
this I've copied the implementation of the x86_pre_reload_split function from 
i386 backend, and renamed it arc_pre_reload_split.

Although the actual implementations of shifts remain the same (as in 
output_shift), having them as explicit instructions in the RTL stream allows 
better scheduling and use of compact forms when available.  The benefits can be 
seen in two short examples below.

For the function:
unsigned int foo(unsigned int x, unsigned int y) {
  return y << 2;
}

GCC with -O2 -mcpu=em would previously generate:
foo:    add r1,r1,r1
        add r1,r1,r1
        j_s.d   [blink]
        mov_s   r0,r1   ;4

[CZI] The move shouldn't be generated indeed. The use of ADDs are slightly 
beneficial for older ARCv1 arches.

and with this patch now generates:
foo:    asl_s r0,r1
        j_s.d   [blink]
        asl_s r0,r0

[CZI] Nice. This new sequence is as fast as we can get for our ARCv2 cpus.

Notice the original (from shift_si3's output_shift) requires the shift sequence 
to be monolithic with the same destination register as the source (requiring an 
extra mov_s).  The new version can eliminate this move, and schedule the second 
asl in the branch delay slot of the return.

For the function:
int x,y,z;

void bar()
{
  x <<= 3;
  y <<= 3;
  z <<= 3;
}

GCC -O2 -mcpu=em currently generates:
bar:    push_s  r13
        ld.as   r12,[gp,@x@sda] ;23
        ld.as   r3,[gp,@y@sda]  ;23
        mov r2,0
        add3 r12,r2,r12
        mov r2,0
        add3 r3,r2,r3
        ld.as   r2,[gp,@z@sda]  ;23
        st.as   r12,[gp,@x@sda] ;26
        mov r13,0
        add3 r2,r13,r2
        st.as   r3,[gp,@y@sda]  ;26
        st.as   r2,[gp,@z@sda]  ;26
        j_s.d   [blink]
        pop_s   r13

where each shift by 3, uses ARC's add3 instruction, which is similar to x86's 
lea implementing x = (y<<3) + z, but requires the value zero to be placed in a 
temporary register "z".  Splitting this before reload allows these pseudos to 
be shared/reused.  With this patch, we get

bar:    ld.as   r2,[gp,@x@sda]  ;23
        mov_s   r3,0    ;3
        add3    r2,r3,r2
        ld.as   r3,[gp,@y@sda]  ;23
        st.as   r2,[gp,@x@sda]  ;26
        ld.as   r2,[gp,@z@sda]  ;23
        mov_s   r12,0   ;3
        add3    r3,r12,r3
        add3    r2,r12,r2
        st.as   r3,[gp,@y@sda]  ;26
        st.as   r2,[gp,@z@sda]  ;26
        j_s     [blink]

[CZI] Looks great, but it also shows that I've forgot to add to ADD3 
instruction the Ra,LIMM,RC variant, which will lead to have instead of 
        mov_s   r3,0    ;3
        add3    r2,r3,r2
Only this add3,0,r2, Indeed it is longer instruction but faster.

Unfortunately, register allocation means that we only share two of the three 
"mov_s z,0", but this is sufficient to reduce register pressure enough to avoid 
spilling r13 in the prologue/epilogue.

This patch also contains a (latent?) bug fix.  The implementation of the 
default insn "length" attribute, assumes instructions of type "shift" have two 
input operands and accesses operands[2], hence specializations of shifts that 
don't have a operands[2], need to be categorized as type "unary" (which results 
in the correct length).

[CZI] The ARC types need an upgrade too.

This patch has been tested on a cross-compiler to arc-elf (hosted on 
x86_64-pc-linux-gnu), but because I've an incomplete tool chain many of the 
regression test fail, but there are no new failures with new test cases added 
below.  If you can confirm that there are no issues from additional testing, is 
this OK for mainline?

Finally a quick technical question.  ARC's zero overhead loops require at least 
two instructions in the loop, so currently the backend's implementation of 
shr20 pads the loop body with a "nop".

lshr20: mov.f lp_count, 20
        lpnz    2f
        lsr r0,r0
        nop
2:      # end single insn loop
        j_s     [blink]


[CZI] The ZOLs (LP instructions) are not great when dealing with short loop 
blocks. Hence, the NOP instruction. Personally, I don't fancy using the LP 
instruction in this case, as it prohibits LP usage for a true for-loop.

could this be more efficiently implemented as:

lshr20: mov lp_count, 10
        lp 2f
        lsr_s r0,r0
        lsr_s r0,r0
2:      # end single insn loop
        j_s     [blink]

i.e. half the number of iterations, but doing twice as much useful work in each 
iteration?  Or might the nop be free on advanced microarchitectures, and/or the 
consecutive dependent shifts cause a pipeline stall?  It would be nice to fuse 
loops to implement rotations, such that rotr16 (aka swap) would look like:

rot16:  mov_s r1,r0
        mov lp_count, 16
        lp 2f
        asl_s r0,r0
        lsr_s r1,r1
2:      # end single insn loop
        j_s.d    [blink]
        or_s r0,r0,r1

Thanks in advance,
Roger


2023-09-28  Roger Sayle  <ro...@nextmovesoftware.com>

gcc/ChangeLog
        * config/arc/arc-protos.h (emit_shift): Delete prototype.
        (arc_pre_reload_split): New function prototype.
        * config/arc/arc.cc (emit_shift): Delete function.
        (arc_pre_reload_split): New predicate function, copied from i386,
        to schedule define_insn_and_split splitters to the split1 pass.
        * config/arc/arc.md (ashlsi3): Expand RTL template unconditionally.
        (ashrsi3): Likewise.
        (lshrsi3): Likewise.
        (shift_si3): Move after other shift patterns, and disable when
        operands[2] is one (which is handled by its own define_insn).
        Use shiftr4_operator, instead of shift4_operator, as this is no
        longer used for left shifts.
        (shift_si3_loop): Likewise.  Additionally remove match_scratch.
        (*ashlsi3_nobs): New pre-reload define_insn_and_split.
        (*ashrsi3_nobs): Likewise.
        (*lshrsi3_nobs): Likewise.
        (rotrsi3_cnt1): Rename define_insn from *rotrsi3_cnt1.
        (ashlsi3_cnt1): Rename define_insn from *ashlsi2_cnt1.  Change
        insn type attribute to "unary", as this doesn't have operands[2].
        Change length attribute to "*,4" to allow compact representation.
        (lshrsi3_cnt1): Rename define_insn from *lshrsi3_cnt1.  Change
        insn type attribute to "unary", as this doesn't have operands[2].
        (ashrsi3_cnt1): Rename define_insn from *ashrsi3_cnt1.  Change
        insn type attribute to "unary", as this doesn't have operands[2].
        (add_shift): Rename define_insn from *add_shift.
        * config/arc/predicates.md (shiftl4_operator): Delete.
        (shift4_operator): Delete.

gcc/testsuite/ChangeLog
        * gcc.target/arc/ashrsi-1.c: New TARGET_BARREL_SHIFTER test case.
        * gcc.target/arc/ashrsi-2.c: New !TARGET_BARREL_SHIFTER test case.
        * gcc.target/arc/ashrsi-3.c: Likewise.
        * gcc.target/arc/ashrsi-4.c: Likewise.
        * gcc.target/arc/ashrsi-5.c: Likewise.
        * gcc.target/arc/lshrsi-1.c: New TARGET_BARREL_SHIFTER test case.
        * gcc.target/arc/lshrsi-2.c: New !TARGET_BARREL_SHIFTER test case.
        * gcc.target/arc/lshrsi-3.c: Likewise.
        * gcc.target/arc/lshrsi-4.c: Likewise.
        * gcc.target/arc/lshrsi-5.c: Likewise.
        * gcc.target/arc/shlsi-1.c: New TARGET_BARREL_SHIFTER test case.
        * gcc.target/arc/shlsi-2.c: New !TARGET_BARREL_SHIFTER test case.
        * gcc.target/arc/shlsi-3.c: Likewise.
        * gcc.target/arc/shlsi-4.c: Likewise.
        * gcc.target/arc/shlsi-5.c: Likewise.

Reply via email to