Hi, A shortcoming of older versions of GAS makes branch swapping not happen if the instruction to be reordered into a branch delay slot immediately follows a delay slot of another branch. This happens to hit some MIPS16 call stubs, e.g. (from libgcc.a):
00000000 <__mips16_call_stub_sf_0>: 0: 03e09021 move s2,ra 4: 0040f809 jalr v0 8: 0040c821 move t9,v0 c: 44020000 mfc1 v0,$f0 10: 02400008 jr s2 14: 00000000 nop The shortcoming has been recently lifted, but I gather GCC generally wants to (and does) schedule delay slots elsewhere manually, so why not to do so here as well. The piece of code above is generated from libgcc/config/mips/mips16.S with a macro called DELAYf() meant for pieces that read from an FPR. There's a complementing macro called DELAYt() to write an FPR that does schedule the delay slot manually. The reason for such an arrangement is I believe a possibility that a read from CP1 may require another instruction to complete before the value read is available in the destination GPR (a coprocessor move delay slot). I believe the only legacy MIPS processors that implemented the MIPS16 ASE in its original variation (i.e. with no compact jumps, no SAVE/RESTORE, and no extend instructions) were the LSI's TinyRISC cores. It's unclear to me from TinyRISC documentation whether these cores suffered from the coprocessor move delay slot. They featured a short three-stage pipeline that had a bypass implemented to make data from memory loads available to the immediately following instruction if needed, in parallel to the destination register write back, to avoid load delay slots. Unfortunately documentation does not mention whether such a bypass was available for coprocessor moves or not, even though the instructions are said to have the very same pipeline stages as memory moves. It is therefore safe to assume coprocessor move delay slots were required. OTOH no modern MIPS architecture processor requires coprocessor move delay slots (they were lifted with the MIPS IV ISA legacy ISA already), hence the current arrangement incurs unnecessary text space consumption and a performance hit for all the modern targets. Especially as in many cases the cases the next instruction executed after the branch delay slot will not access the GPR anyway and thus will not cause any potential pipeline stall even with any less efficient architecture implementations. This change therefore enables manual delay-slot scheduling of move-from-CP1 instructions whenever the stubs are built for the MIPS IV or a newer ISA. It makes the stub above look like this: 00000000 <__mips16_call_stub_sf_0>: 0: 03e09021 move s2,ra 4: 0040f809 jalr v0 8: 0040c821 move t9,v0 c: 02400008 jr s2 10: 44020000 mfc1 v0,$f0 These stubs are I believe not really covered in our testing, because they require a mixed standard-MIPS/MIPS16 environment. I have therefore verified libgcc.a object code by inspection to be still correct after this change, i.e. no change at all with current GAS (that otherwise schedules these move-from-CP1 instructions into the following jump's delay slot automatically) and the expected improved code with old GAS (that otherwise inserts a NOP into that delay slot instead). OK to apply? 2013-07-29 Maciej W. Rozycki <ma...@codesourcery.com> libgcc/ * config/mips/mips16.S (DELAYf): Alias to DELAYt for the MIPS IV ISA and up. Maciej gcc-mips16-stub-delay-slot.patch Index: gcc-fsf-trunk-quilt/libgcc/config/mips/mips16.S =================================================================== --- gcc-fsf-trunk-quilt.orig/libgcc/config/mips/mips16.S 2013-03-27 15:20:54.000000000 +0000 +++ gcc-fsf-trunk-quilt/libgcc/config/mips/mips16.S 2013-07-13 02:40:38.300930313 +0100 @@ -89,8 +89,13 @@ see the files COPYING3 and COPYING.RUNTI OPCODE, OP2; \ .set reorder +#if __mips >= 4 +/* Coprocessor moves are interlocked from the MIPS IV ISA up. */ +#define DELAYf(T, OPCODE, OP2) DELAYt (T, OPCODE, OP2) +#else /* Use "OPCODE. OP2" and jump to T. */ #define DELAYf(T, OPCODE, OP2) OPCODE, OP2; jr T +#endif /* MOVE_SF_BYTE0(D) Move the first single-precision floating-point argument between