Hi Maciej,
I'm glad to hear from you again!
> I think that should be a GAS warning really (similarly to macros that
> expand to multiple instructions in a delay slot) as people ought to be
> allowed to do what they wish, and then `-Werror' can be used for code
> quality enforcement (and possibly disabled on a case-by-case basis).
How can one reasonably prevent the bug when compiling a whole Linux
distribution with thousands of packages if GAS merely warns and proceeds
in many cases?
> > [ In theory, GAS could actually insert NOPs prior to the noreorder
> > directive,
> > to make the loop longer that six instructions, but GAS does not have that
> > kind of capability. Another option that GCC prevents is to place a NOP in
> > the delay slot. ]
>
> Well, GAS does have that capability, although of course it is not enabled
> for `noreorder' code.
Hmm? I'm unable to make sense of that, since such NOPs are unaffected by
noreorder. This is what I meant:
loop: addiu $5,$5,1
addiu $4,$4,1
lb $2,-1($5)
nop /* These two NOPs would prevent the */
nop /* bug while preserving noreorder. */
.set noreorder
.set nomacro
bne $2,$3,loop
sb $2,-1($4)
.set macro
.set reorder
> For generated code I think however that usually it
> will be cheaper performance-wise if a non-trivial delay-slot instruction
> is never produced in the affected cases (i.e. a dummy NOP is always used).
I suppose so too. One observation is that R5900 BogoMIPS went down from
584 to 195 when switching from the generic kernel variant
.set noreorder
1: bnez a0,1b
addiu a0,a0,-1
.set reorder
that is undefined for R5900 in arch/mips/lib/delay.c, to a special variant
beqz a0,2f
1: addiu a0,a0,-1
bnez a0,1b
2:
for the R5900 where GAS will place a NOP in the branch delay slot. With
loop unrolling BogoMIPS could be inflated again, of course. In general,
unrolling is a bit better for the R5900 than general MIPS. Perhaps GCC
could be informed about that, possibly via profile-guided optimisation.
> > A reasonable fix for GCC is perhaps to update gcc/config/mips/mips.md to not
> > make explicit use of the branch delay slot, as suggested by the patch below?
> > Then GCC will emit
> >
> > loop: addiu $5,$5,1
> > addiu $4,$4,1
> > lb $2,-1($5)
> > sb $2,-1($4)
> > bne $2,$3,loop
> >
> > that GAS will adjust in the ELF object to
> >
> > 4: 24a50001 addiu a1,a1,1
> > 8: 24840001 addiu a0,a0,1
> > c: 80a2ffff lb v0,-1(a1)
> > 10: a082ffff sb v0,-1(a0)
> > 14: 1443fffb bne v0,v1,4
> > 18: 00000000 nop
> >
> > where a NOP is placed in the delay slot to avoid the bug. For longer loops,
> > this relies on GAS making appropriate use of the delay slot. I'm not sure if
> > GAS is good at that, though?
>
> I'm sort-of surprised that GCC has produced `reorder' code here, making
> it rely on GAS for delay slot scheduling. Have you used an unusual set of
> options that prevents GCC from making `noreorder' code, which as I recall
> is the usual default (not for the MIPS16 mode IIRC, as well as some
> obscure corner cases)?
I used the suggested patch, and recompiled the kernel with it, and verified
with the machine code tool that there were no undefined loops. Wouldn't that
be enough?
> > diff --git a/gcc/config/mips/mips.md b/gcc/config/mips/mips.md
> > index e17b1d522f0..acd31a8960c 100644
> > --- a/gcc/config/mips/mips.md
> > +++ b/gcc/config/mips/mips.md
> > @@ -1091,6 +1091,7 @@
> >
> > (define_delay (and (eq_attr "type" "branch")
> > (not (match_test "TARGET_MIPS16"))
> > + (not (match_test "TARGET_FIX_R5900"))
> > (eq_attr "branch_likely" "yes"))
> > [(eq_attr "can_delay" "yes")
> > (nil)
> > @@ -1100,6 +1101,7 @@
> > ;; not annul on false.
> > (define_delay (and (eq_attr "type" "branch")
> > (not (match_test "TARGET_MIPS16"))
> > + (not (match_test "TARGET_FIX_R5900"))
> > (ior (match_test "TARGET_CB_NEVER")
> > (and (eq_attr "compact_form" "maybe")
> > (not (match_test "TARGET_CB_ALWAYS")))
> >
>
> I think you need to modify the default `can_delay' attribute definition
> instead (in the same way).
I tried to limit the case to branch delays only, which is a rough
approximation. Call and jump delay slots are acceptable. Are you referring
to this piece?
;; Can the instruction be put into a delay slot?
(define_attr "can_delay" "no,yes"
(if_then_else (and (eq_attr "type" "!branch,call,jump")
(eq_attr "hazard" "none")
(match_test "get_attr_insn_count (insn) == 1"))
(const_string "yes")
(const_string "no")))
> An improved future version might determine the
> exact conditions as noted in your proposed commit description, however I'd
> suggest making this simple change first.
Learning the exact conditions, in a hardware sense, would be interesting
indeed, as some short loops actually do appear to work despite being labeled
as undefined. A fairly deep understanding of the R5900 implementation is
essential for such an undertaking. :)
Fredrik