Hi, On Tue, 27 Oct 2009, Markus L wrote:
Hi, I recently read the articles about the selective scheduling implementation and found it quite interesting, I especially liked the idea of how neatly software pipelining is integrated. The target I am working on is a VLIW DSP so obviously these things are very important for good code generation. However when compiling the following C function with -fselective-scheduling2 and -fsel-sched-pipelining I face a few problems.
Increase verbosity of scheduler dumps to obtain more useful information by passing the following flags: -fdump-rtl-sched1-details -fdump-rtl-sched2-details -fsched-verbose=6 It may also be useful to compare the scheduler behaviour on your target to ia64. Note that building full-fledged cross-compiler wouldn't be necessary, just 'configure --target=ia64-linux && make all-gcc' and invoke gcc/cc1 (to produce dumps, change 'sched2' to 'mach' in the line above, since sel-sched is invoked from machine-reorg pass on ia64). More comments below.
long dotproduct2(int *a, int *b) { int i; long s=0; for (i = 0; i < 256; i++) s += (long)*a++**b++; return s; } The output I get from sched2 pass is: ... Scheduling region 0 Scheduling on fences: (uid:32;seqno:6;) scanning new insn with uid = 80. deleting insn with uid = 80. Scheduled 0 bookkeeping copies, 0 insns needed bookkeeping, 0 insns renamed, 0 insns substituted Scheduling region 1 Scheduling on fences: (uid:72;seqno:1;) scanning new insn with uid = 81. deleting insn with uid = 81. Scheduled 0 bookkeeping copies, 0 insns needed bookkeeping, 0 insns renamed, 0 insns substituted Scheduling region 2 Scheduling on fences: (uid:65;seqno:1;) scanning new insn with uid = 82. deleting insn with uid = 82. Scheduled 0 bookkeeping copies, 0 insns needed bookkeeping, 0 insns renamed, 0 insns substituted (note 26 27 65 2 NOTE_INSN_FUNCTION_BEG) (insn:TI 65 26 30 2 dotprod2.c:2 (set (mem:QI (pre_dec (reg/f:QI 32 sp)) [0 S1 A16]) (reg/f:QI 32 sp)) 12 {pushqi1} (nil)) (insn 30 65 62 2 dotprod2.c:2 (set (reg/v:HI 16 a0l [orig:62 s ] [62]) (const_int 0 [0x0])) 6 {*zero_load_hi} (expr_list:REG_EQUAL (const_int 0 [0x0]) (nil))) (insn 62 30 66 2 dotprod2.c:2 (set (reg:QI 2 r2 [70]) (const_int 256 [0x100])) 5 {*constant_load_qi} (expr_list:REG_EQUAL (const_int 256 [0x100]) (nil))) (insn:TI 66 62 67 2 dotprod2.c:2 (set (mem:QI (pre_dec (reg/f:QI 32 sp)) [0 S1 A16]) (reg/f:QI 33 dp)) 12 {pushqi1} (nil)) (insn:TI 67 66 69 2 dotprod2.c:2 (set (reg/f:QI 33 dp) (reg/f:QI 32 sp)) 10 {*move_regs_qi} (nil)) (note 69 67 39 2 NOTE_INSN_PROLOGUE_END) (code_label 39 69 31 3 2 "" [1 uses]) (note 31 39 34 3 [bb 3] NOTE_INSN_BASIC_BLOCK) (note 34 31 32 3 NOTE_INSN_DELETED) (insn:TI 32 34 33 3 dotprod2.c:10 (set (reg:QI 19 a1h [67]) (mem:QI (post_inc:QI (reg/v/f:QI 1 r1 [orig:65 b ] [65])) [2 S1 A16])) 3 {*load_word_qi_with_post_inc} (expr_list:REG_INC (reg/v/f:QI 1 r1 [orig:65 b ] [65]) (nil))) (insn 33 32 35 3 dotprod2.c:10 (set (reg:QI 18 a1l [68]) (mem:QI (post_inc:QI (reg/v/f:QI 0 r0 [orig:64 a ] [64])) [2 S1 A16])) 3 {*load_word_qi_with_post_inc} (expr_list:REG_INC (reg/v/f:QI 0 r0 [orig:64 a ] [64]) (nil))) (insn 35 33 61 3 dotprod2.c:10 (set (reg/v:HI 16 a0l [orig:62 s ] [62]) (plus:HI (mult:HI (sign_extend:HI (reg:QI 19 a1h [67])) (sign_extend:HI (reg:QI 18 a1l [68]))) (reg/v:HI 16 a0l [orig:62 s ] [62]))) 23 {multacc} (expr_list:REG_DEAD (reg:QI 19 a1h [67]) (expr_list:REG_DEAD (reg:QI 18 a1l [68]) (nil)))) (jump_insn:TI 61 35 75 3 dotprod2.c:8 (parallel [ (set (pc) (if_then_else (ne (reg:QI 2 r2 [70]) (const_int 1 [0x1])) (label_ref:QI 39) (pc))) (set (reg:QI 2 r2 [70]) (plus:QI (reg:QI 2 r2 [70]) (const_int -1 [0xffffffff]))) (use (const_int 255 [0xff])) (use (const_int 255 [0xff])) (use (const_int 1 [0x1])) ]) 43 {doloop_end_internal} (expr_list:REG_BR_PROB (const_int 9899 [0x26ab]) (nil))) (note 75 61 70 4 [bb 4] NOTE_INSN_BASIC_BLOCK) (note 70 75 72 4 NOTE_INSN_EPILOGUE_BEG) ... The loop body is not correctly scheduled, the TImode flags indicate that the entire loop-body will be executed in a single cycle as a VLIW packet and this will not work since no loop-prologue code has been emitted.
I suspect your machine description says that dependency between loads and multiply-add has zero latency, thus allowing the scheduler to place them into one instruction group. Grep for various comments about tick_check_p function. In verbose scheduler dumps, there should be something like Expr 35 is not ready yet until cycle 2 No best expr found! Finished a cycle. Current cycle = 2
My (probably quite limited) understanding of what should happen is that: 1. the fence is placed at (before) uid 32. 2. Instructions uid 32 and uid 33 are scheduled in this vliw group 3. The fence is advanced to to uid 35. 4. Instruction uid 35 is scheduled and instructions uid 32 and 33 are moved up and scheduled in this group also. In the process of moving up uid 32 and 33 bookkeeping copies are created on the loop entry edge.
On the high level, yes. In this particular example, pipelining of loads would not be possible for the following reasons: 1) speculative motion of loads with pre/post-increment is not implemented (ia64 backend disables auto-inc generation pass when sel-sched is enabled); 2) when pipelining loads, scheduler needs to transform them into control-speculative form (since loop epilogue is not generated, load on the very last iteration of the transformed loop may access unallocated memory). In other words, selective scheduler does not preserve number of instruction executions (pipelined instructions from original loop will be executed more times than number of loop iterations). Speculative loads are not supported by any mainline GCC target except ia64. I suggest you also take a look at modulo scheduling, which does not need speculation support to pipeline loops with loads. However, it also does not currently support loads with post-increment.
I've tried to debug this without much success and would very much appreciate any comments on what to look for or what I might be doing wrong. The GCC version that I am using is 4.4.1. BR /Markus