Hi,

On Tue, 27 Oct 2009, Markus L wrote:

Hi,

I recently read the articles about the selective scheduling
implementation and found it quite interesting, I especially liked the
idea of how neatly software pipelining is integrated. The target I am
working on is a VLIW DSP so obviously these things are very important
for good code generation.

However when compiling the following C function with
-fselective-scheduling2 and -fsel-sched-pipelining I face a few
problems.

Increase verbosity of scheduler dumps to obtain more useful information by
passing the following flags:
 -fdump-rtl-sched1-details -fdump-rtl-sched2-details -fsched-verbose=6

It may also be useful to compare the scheduler behaviour on your target to
ia64.  Note that building full-fledged cross-compiler wouldn't be necessary,
just 'configure --target=ia64-linux && make all-gcc' and invoke gcc/cc1
(to produce dumps, change 'sched2' to 'mach' in the line above, since
sel-sched is invoked from machine-reorg pass on ia64).

More comments below.

long dotproduct2(int *a, int *b)
{
   int i;
   long s=0;

   for (i = 0; i < 256; i++)
      s += (long)*a++**b++;
   return s;
}

The output I get from sched2 pass is:
...
Scheduling region 0

Scheduling on fences: (uid:32;seqno:6;)
scanning new insn with uid = 80.
deleting insn with uid = 80.
Scheduled 0 bookkeeping copies, 0 insns needed bookkeeping, 0 insns
renamed, 0 insns substituted
Scheduling region 1

Scheduling on fences: (uid:72;seqno:1;)
scanning new insn with uid = 81.
deleting insn with uid = 81.
Scheduled 0 bookkeeping copies, 0 insns needed bookkeeping, 0 insns
renamed, 0 insns substituted
Scheduling region 2

Scheduling on fences: (uid:65;seqno:1;)
scanning new insn with uid = 82.
deleting insn with uid = 82.
Scheduled 0 bookkeeping copies, 0 insns needed bookkeeping, 0 insns
renamed, 0 insns substituted

(note 26 27 65 2 NOTE_INSN_FUNCTION_BEG)

(insn:TI 65 26 30 2 dotprod2.c:2 (set (mem:QI (pre_dec (reg/f:QI 32
sp)) [0 S1 A16])
       (reg/f:QI 32 sp)) 12 {pushqi1} (nil))

(insn 30 65 62 2 dotprod2.c:2 (set (reg/v:HI 16 a0l [orig:62 s ] [62])
       (const_int 0 [0x0])) 6 {*zero_load_hi} (expr_list:REG_EQUAL
(const_int 0 [0x0])
       (nil)))

(insn 62 30 66 2 dotprod2.c:2 (set (reg:QI 2 r2 [70])
       (const_int 256 [0x100])) 5 {*constant_load_qi}
(expr_list:REG_EQUAL (const_int 256 [0x100])
       (nil)))

(insn:TI 66 62 67 2 dotprod2.c:2 (set (mem:QI (pre_dec (reg/f:QI 32
sp)) [0 S1 A16])
       (reg/f:QI 33 dp)) 12 {pushqi1} (nil))

(insn:TI 67 66 69 2 dotprod2.c:2 (set (reg/f:QI 33 dp)
       (reg/f:QI 32 sp)) 10 {*move_regs_qi} (nil))

(note 69 67 39 2 NOTE_INSN_PROLOGUE_END)

(code_label 39 69 31 3 2 "" [1 uses])

(note 31 39 34 3 [bb 3] NOTE_INSN_BASIC_BLOCK)

(note 34 31 32 3 NOTE_INSN_DELETED)

(insn:TI 32 34 33 3 dotprod2.c:10 (set (reg:QI 19 a1h [67])
       (mem:QI (post_inc:QI (reg/v/f:QI 1 r1 [orig:65 b ] [65])) [2
S1 A16])) 3 {*load_word_qi_with_post_inc} (expr_list:REG_INC
(reg/v/f:QI 1 r1 [orig:65 b ] [65])
       (nil)))

(insn 33 32 35 3 dotprod2.c:10 (set (reg:QI 18 a1l [68])
       (mem:QI (post_inc:QI (reg/v/f:QI 0 r0 [orig:64 a ] [64])) [2
S1 A16])) 3 {*load_word_qi_with_post_inc} (expr_list:REG_INC
(reg/v/f:QI 0 r0 [orig:64 a ] [64])
       (nil)))

(insn 35 33 61 3 dotprod2.c:10 (set (reg/v:HI 16 a0l [orig:62 s ] [62])
       (plus:HI (mult:HI (sign_extend:HI (reg:QI 19 a1h [67]))
               (sign_extend:HI (reg:QI 18 a1l [68])))
           (reg/v:HI 16 a0l [orig:62 s ] [62]))) 23 {multacc}
(expr_list:REG_DEAD (reg:QI 19 a1h [67])
       (expr_list:REG_DEAD (reg:QI 18 a1l [68])
           (nil))))

(jump_insn:TI 61 35 75 3 dotprod2.c:8 (parallel [
           (set (pc)
               (if_then_else (ne (reg:QI 2 r2 [70])
                       (const_int 1 [0x1]))
                   (label_ref:QI 39)
                   (pc)))
           (set (reg:QI 2 r2 [70])
               (plus:QI (reg:QI 2 r2 [70])
                   (const_int -1 [0xffffffff])))
           (use (const_int 255 [0xff]))
           (use (const_int 255 [0xff]))
           (use (const_int 1 [0x1]))
       ]) 43 {doloop_end_internal} (expr_list:REG_BR_PROB (const_int
9899 [0x26ab])
       (nil)))

(note 75 61 70 4 [bb 4] NOTE_INSN_BASIC_BLOCK)

(note 70 75 72 4 NOTE_INSN_EPILOGUE_BEG)

...

The loop body is not correctly scheduled, the TImode flags indicate
that the entire loop-body will be executed in a single cycle as a VLIW
packet and this will not work since no loop-prologue code has been
emitted.

I suspect your machine description says that dependency between loads and
multiply-add has zero latency, thus allowing the scheduler to place them into
one instruction group.  Grep for various comments about tick_check_p function.
In verbose scheduler dumps, there should be something like

Expr 35 is not ready yet until cycle 2
No best expr found!
Finished a cycle.  Current cycle = 2

My (probably quite limited) understanding of what should happen is that:

1. the fence is placed at (before) uid 32.
2. Instructions uid 32 and uid 33 are scheduled in this vliw group
3. The fence is advanced to to uid 35.
4. Instruction uid 35 is scheduled and instructions uid 32 and 33 are
moved up and scheduled in this group also. In the process of moving up
uid 32 and 33 bookkeeping copies are created on the loop entry edge.

On the high level, yes.  In this particular example, pipelining of loads would
not be possible for the following reasons:
1) speculative motion of loads with pre/post-increment is not implemented
(ia64 backend disables auto-inc generation pass when sel-sched is enabled);

2) when pipelining loads, scheduler needs to transform them into
control-speculative form (since loop epilogue is not generated, load on the
very last iteration of the transformed loop may access unallocated memory).
In other words, selective scheduler does not preserve number of instruction
executions (pipelined instructions from original loop will be executed more
times than number of loop iterations).
Speculative loads are not supported by any mainline GCC target except ia64.

I suggest you also take a look at modulo scheduling, which does not need
speculation support to pipeline loops with loads.  However, it also does not
currently support loads with post-increment.

I've tried to debug this without much success and would very much
appreciate any comments on what to look for or what I might be doing
wrong.

The GCC version that I am using is 4.4.1.

BR
/Markus

Reply via email to