Ping. Ayal, could you review this patch and these two patches too. http://gcc.gnu.org/ml/gcc-patches/2011-12/msg00505.html http://gcc.gnu.org/ml/gcc-patches/2011-12/msg00506.html
Happy holidays. 2011/12/7 Roman Zhuykov <zhr...@ispras.ru>: > Apologies for the messed up previous e-mail. > > 2011/10/12 Ayal Zaks <ayal.z...@gmail.com>: >>>> - the last jump instruction should look like: pc=(regF!=0)?label:pc, regF >>>> is >> >> you'd probably want to bump to next instruction if falling through, >> e.g., pc=(regF!=0)?label:pc+4 >> > > It is considered that program counter is increased automatically on > hardware level. > Otherwise we should add something like "pc=pc+4" in parallel to each > instruction in RTL. > >>>> flag register; >>>> - the last instruction which sets regF should be: regF=COMPARE(regC,X), >>>> where X >>>> is a constant, or maybe a register, which is not changed inside a loop; >>>> - only one instruction modifies regC inside a loop (other can use regC, >>>> but not >>>> write), and it should simply adjust it by a constant: regC=regC+step, >>>> where >>>> step is a constant. >>> >>>> When doloop is succesfully scheduled by SMS, its number of >>>> iterations of loop kernel should be decreased by the number of stages in a >>>> schedule minus one, while other iterations expand to prologue and epilogue. >>>> In new supported loops such approach can't be used, because some >>>> instructions can use count register (regC). Instead of this, >>>> the final register value X in compare instruction regF=COMPARE(regC,X) >>>> is changed to another value Y respective to the stage this instruction >>>> is scheduled (Y = X - stage * step). >> >> making sure this does not underflow; i.e., that the number of >> iterations is no less than stage (you've addressed this towards the >> end below). >> > > Yes, this situation is processed correctly. > >>> >>> The main difference from doloop case is that regC can be used by some >>> instructions in loop body. >>> That's why we are unable to simply adjust regC initial value, but have >>> to keep it's value correct on each particular iteration. >>> So, we change comparison instruction accordingly. >>> >>> An example: >>> int a[100]; >>> int main() >>> { >>> int i; >>> for (i = 85; i > 12; i -= 5) >>> a[i] = i * i; >>> return a[15]-225; >>> } >>> ARM assembler with "-O2 -fno-auto-inc-dec": >>> ldr r0, .L5 >>> mov r3, #85 >>> mov r2, r0 >>> .L2: >>> mul r1, r3, r3 >>> sub r3, r3, #5 >>> cmp r3, #10 >>> str r1, [r2, #340] >>> sub r2, r2, #20 >>> bne .L2 >>> ldr r0, [r0, #60] >>> sub r0, r0, #225 >>> bx lr >>> .L5: >>> .word a >>> >>> Loop body is executed 15 times. >>> When compiling with SMS, it finds a schedule with ii=7, stage_count=3 >>> and following times: >>> Stage Time Insn >>> 0 5 mul r1, r3, r3 >>> 1 10 sub r3, r3, #5 >>> 1 11 cmp r3, #10 >>> 1 11 str r1, [r2, #340] >>> 1 13 bne .L2 >>> 2 16 sub r2, r2, #20 >>> >> >> branch is not scheduled last? >> > > Yes, branch schedule time is smaller then sub's one. > This mean that "sub r2, r2, $20" instruction from original iteration > number K will be executed after > "bne .L2" from original iteration number K. > But certainly bne remains to be the last instuction in new loop body. > Below you can see how it looks after SMS. > >>> To make new schedule correct the loop body >>> should be executed 14 times and we change compare instruction: >> >> the loop itself should execute 13 times. > > with i = > 85, 80, 75, 70, 65 > 60, 55, 50, 45, 40 > 35, 30, 25, 20, 15 > this gives total 15 iterations (15 stores to memory). > And new loop body will be executed 13 times (one store goes to > epilogue and one - to prologue). > >>> regF=COMPARE(regC,X) to regF=COMPARE(regC,Y) where Y = X - stage * step. >>> In our example regC is r3, X is 10, step = -5, compare instruction >>> is scheduled on stage 1, so it should be Y = 10 - 1 * (-5) = 15. >>> >> >> right. In general, if the compare is on stage s (starting from 0), it >> will be executed s times in the epilog, so it should exit the loop >> upon reaching Y = X - s * step. >> >>> So, after SMS it looks like: >>> ldr r0, .L5 >>> mov r3, #85 >>> mov r2, r0 >>> ;;prologue >>> mul r1, r3, r3 ;;from stage 0 first iteration >>> sub r3, r3, #5 ;;3 insns from stage 1 first iteration >>> cmp r3, #10 >>> str r1, [r2, #340] >>> mul r1, r3, r3 ;;from stage 0 second iteration >>> ;;body >>> .L2: >>> sub r3, r3, #5 >>> sub r2, r2, #20 >>> cmp r3, #15 ;; new value to compare with is Y=15 >>> str r1, [r2, #340] >>> mul r1, r3, r3 >>> bne .L2 >>> ;;epilogue >>> sub r2, r2, #20 ;;from stage 2 pre-last iteration >>> sub r3, r3, #5 ;;3 insns from stage 1 last iteration >>> cmp r3, #10 >>> str r1, [r2, #340] >>> sub r2, r2, #20 ;;from stage 2 last iteration >>> >>> ldr r0, [r0, #60] >>> sub r0, r0, #225 >>> bx lr >>> .L5: >>> .word a >>> > > Here in comments I mention why insn was copied to prolog and epilog. > Only branch is not copied at all. > >>>> Testing of this appoach reveals two bugs, which do not appear while SMS was >>>> used only for doloop loops. Both these bugs happen due to the nature of >>>> the >>>> flag register. On x86_64 it is clobbered by most of arithmetic >>>> instructions. >> This should ideally be solved by a dedicated (separate) patch. >> ... >> This too should be solved by a dedicated (separate) patch, for easier >> digestion. > > As Ayal asks, I'll continue discussion of these two bugs in two > separate e-mails, answering on this letter. > >>>> >>>> One more thing to point out is number of loop iterations. When number of >>>> iterations of a loop is not known at compile time, SMS has to create two >>>> loop >>>> versions (original and scheduled), and execute scheduled one only when real >>>> number of iterations is bigger than number of stages. In doloop case the >>>> number of iterations simply equals to the count register value before the >>>> loop. >>>> So SMS finds its constant initialization or makes two loop versions. In >>>> new >>>> supported loops number of iterations value is more complex. It even can't >>>> be >>>> calculated as (final_reg_value-start_reg_value)/step because of examples >>>> like >>>> this: >>>> >>>> for (unsigned int x = 0x0; x != 0x6F80919A; x += 0xEDCBA987) ...; >>>> >>>> This loop has 22 iterations. So, i decided to use get_simple_loop_desc >>>> function which gives a structure with loop characteristics, some of them >>>> helps >>>> to find iteration number: >>>> >>>> rtx niter_expr - The number of iterations of the loop; >>>> bool const_iter - True if the loop iterates the constant number of times; >>>> unsigned HOST_WIDEST_INT niter - Number of iterations if constant; >>>> >>>> But we can use these expressions only after looking through some other >>>> fields >>>> of returned structure: >>>> >>>> bool simple_p - True if we are able to say anything about number of >>>> iterations >>>> of the loop; >>>> rtx assumptions - Assumptions under that the rest of the information is >>>> valid; >>>> rtx noloop_assumptions - Assumptions under which the loop ends before >>>> reaching >>>> the latch; >>>> rtx infinite - Condition under which the loop is infinite. >>>> >>>> I decide to allow SMS scheduling only when simple_p is true and other three >>>> fields are NULL_RTX, or when simple_p is true and >>>> flag_unsafe_loop_optimizations is set. One more exception is infinite >>>> condition, and the next separate patch is an attempt to process it. >>>> >> >> ok, still need to go over this rather lengthy and orthogonal (although >> it exposes the bugs above) piece. >> >> Ayal. >> >> > > New version is attached, it suits current trunk. > Without fixing both bugs mentioned above, this patch brokes bootstrap on > x86-64. > > Together with DDG fixes the patch was succesfully regtested on ARM, > and "regstrapped" on x86-64 and IA64. > > -- > Roman Zhuykov > zhr...@ispras.ru