Delay slot filling - what still matters, and what doesn't matter so much anymore?

Steven Bosscher Wed, 17 Apr 2013 14:53:17 -0700

Hello delay-slot target maintainers :-)

As you know, I'm playing with a new for-now-toy delay slot filling
pass that preserves the CFG, and uses DF and sched-deps instead of
resource.c. It's now beginning to take form enough that I run into the
to-be-expected unexpected problems and questions. The biggest problem
is that I have never been this far down into machine details since the
DFA scheduler conversions, and have never worked with targets that
have delay slots. I have no idea what really matters, and I hope you
can help me with some of those questions.



First of all: What is still important to handle?

It's clear that the expectations in reorg.c are "anything goes" but
modern RISCs (everything since the PA-8000, say) probably have some
limitations on what is helpful to have, or not have, in a delay slot.
According to the comments in pa.h about MASK_JUMP_IN_DELAY, having
jumps in delay slots of other jumps is one such thing: They don't
bring benefit to the PA-8000 and they don't work with DWARF2 CFI. As
far as I know, SPARC and MIPS don't allow jumps in delay slots, SH
looks like it doesn't allow it either, and CRIS can do it for short
branches but doesn't do because the trade-off between benefit and
machine description complexity comes out negative. On the scheduler
implementation side: Branches as delayed insns in delay slots of other
branches is impossible to express in the CFG (at least in GCC, but I
think in general it can't be done cleanly). Therefore I want to drop
support for branches in delay slots. What do you think about this?

What about multiple delay slots? It looks like reorg.c has code to
handle insns with multiple delay slots, but there currently are no GCC
targets in the FSF tree that have insns with multiple delay slots and
that use define_delay. The C6X has many more delay slots than just 1
(it can have up to 5 delay slots IIRC) but it is much more flexible
than traditional RISCs when it comes to putting insns in delay slots
(it uses predication so it can annul delayed insns on various
conditions) and it uses a very clever (and effective??) delay slot
filling mechanism via the normal scheduler, using back-tracking and
"jump shadows" (see UNSPEC_JUMP_SHADOW in the cx6 back end). But C6X
doesn't use reorg.c delay slot scheduling. I'm not aware of any
non-VLIW, non-DSP targets with more than one delay slot per insn, and
new VLIW/DSP ports with delay slots probably should look at c6x rather
than using define_delay. Supporting only a single delay slot per
delay_insn would make my scheduler a bit less complex. Would that be
enough for everyone, or is it necessary to continue to support
multiple delay slots per insn?


Another thing I completely fail to grasp, is how the pipeline
scheduler and delay slots interact. Doesn't dbr_schedule destroy all
the good work schedule_insns has tried to do? If so, how much does
that hurt on modern RISCs?


Related question: What, if anything, currently prevents dbr_schedule
from causing pipeline stalls by stuffing a long-latency insn in a
delay slot? I'm currently using a cost function using:

cost = insn_default_latency (trial_insn) - insn_default_latency (delay_insn);

saying that a trial_insn with greater latency than delay_insn, and
from the same basic block as delay_insn, should not be put in the
delay slot. But that's preventing my scheduler from filling slots that
reorg.c does fill. For example a case like this on sparc, where cost=1
is greater than the cost threshold I'm using (cost==0 i.e. no cost):

(gdb) p debug_rtx(delay_insn)
(jump_insn 18 0 0 2 (set (pc)
        (if_then_else (gt (reg:CCX 100 %icc)
                (const_int 0 [0]))
            (label_ref:DI 77)
            (pc))) t.c:18 48 {*normal_branch}
     (expr_list:REG_DEAD (reg:CCX 100 %icc)
        (expr_list:REG_BR_PROB (const_int 2900 [0xb54])
            (nil)))
 -> 77)
$5 = void
(gdb) p insn_default_latency(delay_insn)
$6 = 1
(gdb) p debug_rtx(trial_insn)
(insn/s:TI 16 13 17 2 (set (reg/v:DI 26 %i2 [orig:112 d ] [112])
        (mem/c:DI (plus:DI (reg/f:DI 1 %g1 [122])
                (const_int 24 [0x18])) [2 x+24 S8 A64])) t.c:14 72
{*movdi_insn_sp64}
     (expr_list:REG_DEAD (reg/f:DI 1 %g1 [122])
        (nil)))
$7 = void
(gdb) p insn_default_latency(trial_insn)
$8 = 2
(gdb)

What do you think will be a good strategy to deal with this (short of
integrating delay slot filling in the scheduler proper)? Should I try
to find cost==0 delay slot candidates, and only fill slots with cost>0
candidates if nothing cheap is available? Prefer a nop over cost>0
candidates? Ignore insn_default_latency?


Another thing I noticed about targets with delay slots that can be
nullified, is that at least some of the ifcvt.c transformations could
be applied to fill more delay slots (obviously if_case_1 and
if_case_2. In reorg.c, optimize_skip does some kind of if-conversion.
Has anyone looked at whether optimize_skip still does something, and
derived a test case for that?


Thanks for any comments/suggestions/insights/...

Ciao!
Steven

Delay slot filling - what still matters, and what doesn't matter so much anymore?

Reply via email to