Hi,

 A shortcoming of older versions of GAS makes branch swapping not happen 
if the instruction to be reordered into a branch delay slot immediately 
follows a delay slot of another branch.  This happens to hit some MIPS16 
call stubs, e.g. (from libgcc.a):

00000000 <__mips16_call_stub_sf_0>:
   0:   03e09021        move    s2,ra
   4:   0040f809        jalr    v0
   8:   0040c821        move    t9,v0
   c:   44020000        mfc1    v0,$f0
  10:   02400008        jr      s2
  14:   00000000        nop

The shortcoming has been recently lifted, but I gather GCC generally wants 
to (and does) schedule delay slots elsewhere manually, so why not to do so 
here as well.

 The piece of code above is generated from libgcc/config/mips/mips16.S 
with a macro called DELAYf() meant for pieces that read from an FPR.  
There's a complementing macro called DELAYt() to write an FPR that does 
schedule the delay slot manually.  The reason for such an arrangement is I 
believe a possibility that a read from CP1 may require another instruction 
to complete before the value read is available in the destination GPR (a 
coprocessor move delay slot).

 I believe the only legacy MIPS processors that implemented the MIPS16 ASE 
in its original variation (i.e. with no compact jumps, no SAVE/RESTORE, 
and no extend instructions) were the LSI's TinyRISC cores.  It's unclear 
to me from TinyRISC documentation whether these cores suffered from the 
coprocessor move delay slot.  They featured a short three-stage pipeline 
that had a bypass implemented to make data from memory loads available to 
the immediately following instruction if needed, in parallel to the 
destination register write back, to avoid load delay slots.  
Unfortunately documentation does not mention whether such a bypass was 
available for coprocessor moves or not, even though the instructions are 
said to have the very same pipeline stages as memory moves.  It is 
therefore safe to assume coprocessor move delay slots were required.

 OTOH no modern MIPS architecture processor requires coprocessor move 
delay slots (they were lifted with the MIPS IV ISA legacy ISA already), 
hence the current arrangement incurs unnecessary text space consumption 
and a performance hit for all the modern targets.  Especially as in many 
cases the cases the next instruction executed after the branch delay slot 
will not access the GPR anyway and thus will not cause any potential 
pipeline stall even with any less efficient architecture implementations.

 This change therefore enables manual delay-slot scheduling of 
move-from-CP1 instructions whenever the stubs are built for the MIPS IV or 
a newer ISA. It makes the stub above look like this:

00000000 <__mips16_call_stub_sf_0>:
   0:   03e09021        move    s2,ra
   4:   0040f809        jalr    v0
   8:   0040c821        move    t9,v0
   c:   02400008        jr      s2
  10:   44020000        mfc1    v0,$f0

 These stubs are I believe not really covered in our testing, because they 
require a mixed standard-MIPS/MIPS16 environment.  I have therefore 
verified libgcc.a object code by inspection to be still correct after this 
change, i.e. no change at all with current GAS (that otherwise schedules 
these move-from-CP1 instructions into the following jump's delay slot 
automatically) and the expected improved code with old GAS (that otherwise 
inserts a NOP into that delay slot instead).

 OK to apply?

2013-07-29  Maciej W. Rozycki  <ma...@codesourcery.com>

        libgcc/
        * config/mips/mips16.S (DELAYf): Alias to DELAYt for the MIPS IV 
        ISA and up.

  Maciej

gcc-mips16-stub-delay-slot.patch
Index: gcc-fsf-trunk-quilt/libgcc/config/mips/mips16.S
===================================================================
--- gcc-fsf-trunk-quilt.orig/libgcc/config/mips/mips16.S        2013-03-27 
15:20:54.000000000 +0000
+++ gcc-fsf-trunk-quilt/libgcc/config/mips/mips16.S     2013-07-13 
02:40:38.300930313 +0100
@@ -89,8 +89,13 @@ see the files COPYING3 and COPYING.RUNTI
        OPCODE, OP2;                            \
        .set    reorder
 
+#if __mips >= 4
+/* Coprocessor moves are interlocked from the MIPS IV ISA up.  */
+#define DELAYf(T, OPCODE, OP2) DELAYt (T, OPCODE, OP2)
+#else
 /* Use "OPCODE. OP2" and jump to T.  */
 #define DELAYf(T, OPCODE, OP2) OPCODE, OP2; jr T
+#endif
 
 /* MOVE_SF_BYTE0(D)
        Move the first single-precision floating-point argument between

Reply via email to