Ping.
Ayal, could you review this patch and these two patches too.
http://gcc.gnu.org/ml/gcc-patches/2011-12/msg00505.html
http://gcc.gnu.org/ml/gcc-patches/2011-12/msg00506.html

Happy holidays.

2011/12/7 Roman Zhuykov <zhr...@ispras.ru>:
> Apologies for the messed up previous e-mail.
>
> 2011/10/12 Ayal Zaks <ayal.z...@gmail.com>:
>>>> - the last jump instruction should look like:  pc=(regF!=0)?label:pc, regF 
>>>> is
>>
>> you'd probably want to bump to next instruction if falling through,
>> e.g., pc=(regF!=0)?label:pc+4
>>
>
> It is considered that program counter is increased automatically on
> hardware level.
> Otherwise we should add something like "pc=pc+4" in parallel to each
> instruction in RTL.
>
>>>>  flag register;
>>>> - the last instruction which sets regF should be: regF=COMPARE(regC,X), 
>>>> where X
>>>>  is a constant, or maybe a register, which is not changed inside a loop;
>>>> - only one instruction modifies regC inside a loop (other can use regC, 
>>>> but not
>>>>  write), and it should simply adjust it by a constant: regC=regC+step, 
>>>> where
>>>>  step is a constant.
>>>
>>>> When doloop is succesfully scheduled by SMS, its number of
>>>> iterations of loop kernel should be decreased by the number of stages in a
>>>> schedule minus one, while other iterations expand to prologue and epilogue.
>>>> In new supported loops such approach can't be used, because some
>>>> instructions can use count register (regC).  Instead of this,
>>>> the final register value X in compare instruction regF=COMPARE(regC,X)
>>>> is changed to another value Y respective to the stage this instruction
>>>> is scheduled (Y = X - stage * step).
>>
>> making sure this does not underflow; i.e., that the number of
>> iterations is no less than stage (you've addressed this towards the
>> end below).
>>
>
> Yes, this situation is processed correctly.
>
>>>
>>> The main difference from doloop case is that regC can be used by some
>>> instructions in loop body.
>>> That's why we are unable to simply adjust regC initial value, but have
>>> to keep it's value correct on each particular iteration.
>>> So, we change comparison instruction accordingly.
>>>
>>> An example:
>>> int a[100];
>>> int main()
>>> {
>>>  int i;
>>>  for (i = 85; i > 12; i -= 5)
>>>      a[i] = i * i;
>>>  return a[15]-225;
>>> }
>>> ARM assembler with "-O2 -fno-auto-inc-dec":
>>>        ldr     r0, .L5
>>>        mov     r3, #85
>>>        mov     r2, r0
>>> .L2:
>>>        mul     r1, r3, r3
>>>        sub     r3, r3, #5
>>>        cmp     r3, #10
>>>        str     r1, [r2, #340]
>>>        sub     r2, r2, #20
>>>        bne     .L2
>>>        ldr     r0, [r0, #60]
>>>        sub     r0, r0, #225
>>>        bx      lr
>>> .L5:
>>>        .word   a
>>>
>>> Loop body is executed 15 times.
>>> When compiling with SMS, it finds a schedule with ii=7, stage_count=3
>>> and following times:
>>> Stage  Time       Insn
>>> 0          5      mul     r1, r3, r3
>>> 1         10     sub     r3, r3, #5
>>> 1         11     cmp     r3, #10
>>> 1         11     str     r1, [r2, #340]
>>> 1         13     bne     .L2
>>> 2         16     sub     r2, r2, #20
>>>
>>
>> branch is not scheduled last?
>>
>
> Yes, branch schedule time is smaller then sub's one.
> This mean that "sub r2, r2, $20" instruction from original iteration
> number K will be executed after
> "bne .L2" from original iteration number K.
> But certainly bne remains to be the last instuction in new loop body.
> Below you can see how it looks after SMS.
>
>>> To make new schedule correct the loop body
>>> should be executed 14 times and we change compare instruction:
>>
>> the loop itself should execute 13 times.
>
> with i =
> 85, 80, 75, 70, 65
> 60, 55, 50, 45, 40
> 35, 30, 25, 20, 15
> this gives total 15 iterations (15 stores to memory).
> And new loop body will be executed 13 times (one store goes to
> epilogue and one - to prologue).
>
>>> regF=COMPARE(regC,X) to regF=COMPARE(regC,Y) where Y = X - stage * step.
>>> In our example regC is r3, X is 10, step = -5, compare instruction
>>> is scheduled on stage 1, so it should be Y = 10 - 1 * (-5) = 15.
>>>
>>
>> right. In general, if the compare is on stage s (starting from 0), it
>> will be executed s times in the epilog, so it should exit the loop
>> upon reaching Y = X - s * step.
>>
>>> So, after SMS it looks like:
>>>        ldr     r0, .L5
>>>        mov     r3, #85
>>>        mov     r2, r0
>>> ;;prologue
>>>        mul     r1, r3, r3      ;;from stage 0 first iteration
>>>        sub     r3, r3, #5      ;;3 insns from stage 1 first iteration
>>>        cmp     r3, #10
>>>        str     r1, [r2, #340]
>>>        mul     r1, r3, r3      ;;from stage 0 second iteration
>>> ;;body
>>> .L2:
>>>        sub     r3, r3, #5
>>>        sub     r2, r2, #20
>>>        cmp     r3, #15         ;; new value to compare with is Y=15
>>>        str     r1, [r2, #340]
>>>        mul     r1, r3, r3
>>>        bne     .L2
>>> ;;epilogue
>>>        sub     r2, r2, #20     ;;from stage 2 pre-last iteration
>>>        sub     r3, r3, #5      ;;3 insns from stage 1 last iteration
>>>        cmp     r3, #10
>>>        str     r1, [r2, #340]
>>>        sub     r2, r2, #20     ;;from stage 2 last iteration
>>>
>>>        ldr     r0, [r0, #60]
>>>        sub     r0, r0, #225
>>>        bx      lr
>>> .L5:
>>>        .word   a
>>>
>
> Here in comments I mention why insn was copied to prolog and epilog.
> Only branch is not copied at all.
>
>>>> Testing of this appoach reveals two bugs, which do not appear while SMS was
>>>> used only for doloop loops.  Both these bugs happen due to the nature of 
>>>> the
>>>> flag register.  On x86_64 it is clobbered by most of arithmetic 
>>>> instructions.
>> This should ideally be solved by a dedicated (separate) patch.
>> ...
>> This too should be solved by a dedicated (separate) patch, for easier 
>> digestion.
>
> As Ayal asks, I'll continue discussion of these two bugs in two
> separate e-mails, answering on this letter.
>
>>>>
>>>> One more thing to point out is number of loop iterations. When number of
>>>> iterations of a loop is not known at compile time, SMS has to create two 
>>>> loop
>>>> versions (original and scheduled), and execute scheduled one only when real
>>>> number of iterations is bigger than number of stages.  In doloop case the
>>>> number of iterations simply equals to the count register value before the 
>>>> loop.
>>>> So SMS finds its constant initialization or makes two loop versions.  In 
>>>> new
>>>> supported loops number of iterations value is more complex.  It even can't 
>>>> be
>>>> calculated as (final_reg_value-start_reg_value)/step because of examples 
>>>> like
>>>> this:
>>>>
>>>> for (unsigned int x = 0x0; x != 0x6F80919A; x += 0xEDCBA987) ...;
>>>>
>>>> This loop has 22 iterations.  So, i decided to use get_simple_loop_desc
>>>> function which gives a structure with loop characteristics, some of them 
>>>> helps
>>>> to find iteration number:
>>>>
>>>> rtx niter_expr - The number of iterations of the loop;
>>>> bool const_iter - True if the loop iterates the constant number of times;
>>>> unsigned HOST_WIDEST_INT niter - Number of iterations if constant;
>>>>
>>>> But we can use these expressions only after looking through some other 
>>>> fields
>>>> of returned structure:
>>>>
>>>> bool simple_p - True if we are able to say anything about number of 
>>>> iterations
>>>> of the loop;
>>>> rtx assumptions - Assumptions under that the rest of the information is 
>>>> valid;
>>>> rtx noloop_assumptions - Assumptions under which the loop ends before 
>>>> reaching
>>>> the latch;
>>>> rtx infinite - Condition under which the loop is infinite.
>>>>
>>>> I decide to allow SMS scheduling only when simple_p is true and other three
>>>> fields are NULL_RTX, or when simple_p is true and
>>>> flag_unsafe_loop_optimizations is set.  One more exception is infinite
>>>> condition, and the next separate patch is an attempt to process it.
>>>>
>>
>> ok, still need to go over this rather lengthy and orthogonal (although
>> it exposes the bugs above) piece.
>>
>> Ayal.
>>
>>
>
> New version is attached, it suits current trunk.
> Without fixing both bugs mentioned above, this patch brokes bootstrap on 
> x86-64.
>
> Together with DDG fixes the patch was succesfully regtested on ARM,
> and "regstrapped" on x86-64 and IA64.
>
> --
> Roman Zhuykov
> zhr...@ispras.ru

Reply via email to