Hi Fredrik,

> I'm glad to hear from you again!

 I'm not dead, just distracted.

> >  I think that should be a GAS warning really (similarly to macros that 
> > expand to multiple instructions in a delay slot) as people ought to be 
> > allowed to do what they wish, and then `-Werror' can be used for code 
> > quality enforcement (and possibly disabled on a case-by-case basis).
> 
> How can one reasonably prevent the bug when compiling a whole Linux
> distribution with thousands of packages if GAS merely warns and proceeds
> in many cases?

 I think the tools must not set a policy.  By using `.set noreorder' the 
user told the toolchain he knows better and asked to keep its hands away.

 As I say you can use `-Werror' for code auditing.  And in most cases 
manually scheduled delay slots in handcoded assembly are a sign of a 
problem with the piece of code affected, regardless of the R5900 erratum.

> > > [ In theory, GAS could actually insert NOPs prior to the noreorder 
> > > directive,
> > > to make the loop longer that six instructions, but GAS does not have that
> > > kind of capability. Another option that GCC prevents is to place a NOP in
> > > the delay slot. ]
> > 
> >  Well, GAS does have that capability, although of course it is not enabled 
> > for `noreorder' code.
> 
> Hmm? I'm unable to make sense of that, since such NOPs are unaffected by
> noreorder. This is what I meant:
> 
> loop: addiu   $5,$5,1
>       addiu   $4,$4,1
>       lb      $2,-1($5)
>       nop                     /* These two NOPs would prevent the */
>       nop                     /* bug while preserving noreorder. */
>       .set    noreorder
>       .set    nomacro
>       bne     $2,$3,loop
>       sb      $2,-1($4)
>       .set    macro
>       .set    reorder

 See `nops_for_insn'.  However again, as I noted, `.set noreorder' tells 
GAS not to analyse the dependencies for code within.  There is no need to 
schedule this delay slot manually, and if this is generated code, then the 
producer (GCC) should have taken care of the hazards, be it architectural 
or errata.  If this is manually written code, then the author asked for 
trouble.

> > For generated code I think however that usually it 
> > will be cheaper performance-wise if a non-trivial delay-slot instruction 
> > is never produced in the affected cases (i.e. a dummy NOP is always used).
> 
> I suppose so too. One observation is that R5900 BogoMIPS went down from
> 584 to 195 when switching from the generic kernel variant
> 
>       .set    noreorder
> 1:    bnez    a0,1b
>       addiu   a0,a0,-1
>       .set    reorder
> 
> that is undefined for R5900 in arch/mips/lib/delay.c, to a special variant
> 
>       beqz    a0,2f
> 1:    addiu   a0,a0,-1
>       bnez    a0,1b
> 2:
> 
> for the R5900 where GAS will place a NOP in the branch delay slot. With
> loop unrolling BogoMIPS could be inflated again, of course. In general,
> unrolling is a bit better for the R5900 than general MIPS. Perhaps GCC
> could be informed about that, possibly via profile-guided optimisation.

 Well, BogoMIPS is just an arbitrary number.

 There is a data dependency here, so manual delay slot scheduling was 
unavoidable.  I'd suggest using this as a functional equivalent for the 
R5900:

        addiu   a0,a0,1
1:      addiu   a0,a0,-1
        bnez    a0,1b

and avoiding the irregularity for a0==0.

> > > A reasonable fix for GCC is perhaps to update gcc/config/mips/mips.md to 
> > > not
> > > make explicit use of the branch delay slot, as suggested by the patch 
> > > below?
> > > Then GCC will emit
> > > 
> > > loop:     addiu   $5,$5,1
> > >   addiu   $4,$4,1
> > >   lb      $2,-1($5)
> > >   sb      $2,-1($4)
> > >   bne     $2,$3,loop
> > > 
> > > that GAS will adjust in the ELF object to
> > > 
> > >    4:     24a50001        addiu   a1,a1,1
> > >    8:     24840001        addiu   a0,a0,1
> > >    c:     80a2ffff        lb      v0,-1(a1)
> > >   10:     a082ffff        sb      v0,-1(a0)
> > >   14:     1443fffb        bne     v0,v1,4
> > >   18:     00000000        nop
> > > 
> > > where a NOP is placed in the delay slot to avoid the bug. For longer 
> > > loops,
> > > this relies on GAS making appropriate use of the delay slot. I'm not sure 
> > > if
> > > GAS is good at that, though?
> > 
> >  I'm sort-of surprised that GCC has produced `reorder' code here, making 
> > it rely on GAS for delay slot scheduling.  Have you used an unusual set of 
> > options that prevents GCC from making `noreorder' code, which as I recall 
> > is the usual default (not for the MIPS16 mode IIRC, as well as some 
> > obscure corner cases)?
> 
> I used the suggested patch, and recompiled the kernel with it, and verified
> with the machine code tool that there were no undefined loops. Wouldn't that
> be enough?

 It would break if GAS was invoked without `-mfix-r5900' (e.g. by using an 
explicit `-Wa,-mno-fix-r5900' option).

> > > diff --git a/gcc/config/mips/mips.md b/gcc/config/mips/mips.md
> > > index e17b1d522f0..acd31a8960c 100644
> > > --- a/gcc/config/mips/mips.md
> > > +++ b/gcc/config/mips/mips.md
> > > @@ -1091,6 +1091,7 @@
> > >  
> > >  (define_delay (and (eq_attr "type" "branch")
> > >              (not (match_test "TARGET_MIPS16"))
> > > +            (not (match_test "TARGET_FIX_R5900"))
> > >              (eq_attr "branch_likely" "yes"))
> > >    [(eq_attr "can_delay" "yes")
> > >     (nil)
> > > @@ -1100,6 +1101,7 @@
> > >  ;; not annul on false.
> > >  (define_delay (and (eq_attr "type" "branch")
> > >              (not (match_test "TARGET_MIPS16"))
> > > +            (not (match_test "TARGET_FIX_R5900"))
> > >              (ior (match_test "TARGET_CB_NEVER")
> > >                   (and (eq_attr "compact_form" "maybe")
> > >                        (not (match_test "TARGET_CB_ALWAYS")))
> > > 
> > 
> >  I think you need to modify the default `can_delay' attribute definition 
> > instead (in the same way).
> 
> I tried to limit the case to branch delays only, which is a rough
> approximation. Call and jump delay slots are acceptable. Are you referring
> to this piece?
> 
> ;; Can the instruction be put into a delay slot?
> (define_attr "can_delay" "no,yes"
>   (if_then_else (and (eq_attr "type" "!branch,call,jump")
>                    (eq_attr "hazard" "none")
>                    (match_test "get_attr_insn_count (insn) == 1"))
>               (const_string "yes")
>               (const_string "no")))

 Yes.  My suggestion does not preclude limiting the workaround to branches 
only while being precise about what the situation is, i.e. branches still 
require a delay slot instruction (unlike MIPS16 or uMIPS compact branches) 
but no real instruction is suitable.  I'd prefer if GCC actually scheduled 
the required dummy NOP instruction explicitly.

> > An improved future version might determine the 
> > exact conditions as noted in your proposed commit description, however I'd 
> > suggest making this simple change first.
> 
> Learning the exact conditions, in a hardware sense, would be interesting
> indeed, as some short loops actually do appear to work despite being labeled
> as undefined. A fairly deep understanding of the R5900 implementation is
> essential for such an undertaking. :)

 Ack, although in this case I only meant what was stated in the commit 
description, i.e. avoiding forcing a dummy delay slot instruction where 
known for sure not to be needed, e.g. a long loop, a forward branch, etc.

  Maciej

Reply via email to