Re: [PATCH, RFC, rs6000] PR80791 Consider doloop in ivopts

Richard Biener Fri, 26 Apr 2019 00:16:20 -0700

On Fri, 26 Apr 2019, Kewen.Lin wrote:

> Hi Segher,
> 
> Thanks a lot for your comments!
> 
> on 2019/4/25 下午8:16, Segher Boessenkool wrote:
>  
> > Does it create worse code now?  What we have before your patch isn't
> > so super either (it has an sldi in the loop, it has two mtctr too).
> > Maybe you can show the generated code?
> 
> It's a good question! From the generated codes for the core loop, the 
> code after my patch doesn't have bdnz to leverage hardware CTR, it has
> extra cmpld and branch instead, looks worse.  But I wrote a tiny case
> to invoke the foo and evaluated the running time, they are equal.
> 
> * Measured time:
>   After:
>     real 199.47
>     user 198.35
>     sys 1.11
>   Before:
>     real 199.19
>     user 198.56
>     sys 0.62
> 
> * Command I used to compile:
> ${local_gcc} ${case_src}/20050830-1.c  -fno-diagnostics-show-caret 
> -fno-diagnostics-show-line-numbers -fdiagnostics-color=never  -O2 
> -ffat-lto-objects -fno-ident -c
> 
> * main file (main.c):
> extern void foo();
> #define LOOPNUM 10000
> #define N 1000000*256
> int a[N];
> int main() {
>   for(int i = 0; i < LOOPNUM; i++) {
>     foo(N);
>   }
> }
> 
> 
> * Generated code sequence:
> 
> Before my patch:
> 
>         cmpwi 0,3,511
>         blelr 0
>         addi 10,3,-256
>         addi 9,3,-512
>         addis 8,2,.LC0@toc@ha
>         ld 8,.LC0@toc@l(8)
>         cmpwi 0,10,256
>         rldicl 9,9,56,40
>         sldi 3,3,2
>         addi 9,9,1
>         mtctr 9
>         add 8,3,8
>         li 10,42
>         blt 0,.L7     # jump to L7 it's less than 256
> .L3:                  # core loop
>         stw 10,0(8)   
>         addi 8,8,-1024
>         bdnz .L3
>         blr
> .L7:
>         li 9,1        # special case, iteration only 1
>         mtctr 9
>         b .L3
> 
> After my patch:
> 
>         cmpwi 0,3,511
>         blelr 0
>         addis 7,2,.LC0@toc@ha
>         ld 7,.LC0@toc@l(7)
>         addi 10,3,-512
>         sldi 9,3,2
>         rlwinm 10,10,0,0,23
>         li 8,42
>         subf 10,10,3
>         sldi 10,10,2
>         addi 6,7,-1024
>         add 9,9,7
>         add 10,10,6
>         .p2align 4,,15
> .L3:                      # core loop
>         stw 8,0(9)
>         addi 9,9,-1024
>         cmpld 0,9,10      # cmp 
>         beqlr 0           # if eq, return
>         stw 8,0(9)     
>         addi 9,9,-1024   
>         cmpld 0,9,10      # cmp again 
>         bne 0,.L3         # if ne, jump to L3.
>         blr
> 
> ------------------------
> 
> I practiced whether we can adjust the decision made in ivopts.
> For one compare iv use with zero cost, if one iv cand whose base
> is from memory has costly elim_cost before adjust_setup_cost,
> it's possible to make later analysis unable to find it's finite,
> then we try to avoid it.
> 
> The diff looks like:
> 
> --- a/gcc/tree-ssa-loop-ivopts.c
> +++ b/gcc/tree-ssa-loop-ivopts.c
> @@ -5126,6 +5126,7 @@ determine_group_iv_cost_cond (struct ivopts_data *data,
>    tree *control_var, *bound_cst;
>    enum tree_code comp = ERROR_MARK;
>    struct iv_use *use = group->vuses[0];
> +  bool dont_elim_p = false;
> 
>    /* Extract condition operands.  */
>    rewrite_type = extract_cond_operands (data, use->stmt, &control_var,
> @@ -5152,6 +5153,16 @@ determine_group_iv_cost_cond (struct ivopts_data *data,
>           inv_expr_elim = get_loop_invariant_expr (data, bound);
>           bitmap_clear (inv_vars_elim);
>         }
> +
> +      /* zero cost use makes it easier to select memory based iv cand
> +         for replacement of non memory based iv and its use.   But if
> +         the setup sequence are too costly, loop iv analysis can NOT
> +         easily figure out it's finite, it's possible to stop the
> +         low-overhead loop transformation and get unexpected code.  */
> +      if (use->zero_cost_p && cand->iv->base_object && !use->iv->base_object
> +         && elim_cost.cost >= 30)
> +       dont_elim_p = true;
> +
>        /* The bound is a loop invariant, so it will be only computed
>          once.  */
>        elim_cost.cost = adjust_setup_cost (data, elim_cost.cost);
> @@ -5184,7 +5195,7 @@ determine_group_iv_cost_cond (struct ivopts_data *data,
>    express_cost += bound_cost;
> 
>    /* Choose the better approach, preferring the eliminated IV. */
> -  if (elim_cost <= express_cost)
> +  if (elim_cost <= express_cost && !dont_elim_p)
>      {
> 
> 
> I was thinking whether this zero cost change just exposed
> an existing problem, then this kind of check should be for all 
> cases not only for zero cost use, similar to 
> expression_expensive_p?  But why doesn't it happen before?
> Need more investigation.


We should think about possible ways of encoding doloop at IVOPTs
time rather than trying to re-analyze at RTL.  I suppose we can
easily set a flag in struct loop but I'm not familiar enough with
the doloop pass to tell whether that is good enough.  Also we
can maybe move doloop to GIMPLE completely?  I remember some
targets have loop size constraints here as well, so I guess
that isn't easily possible.

Richard.

> > 
> >> Btw, this is for GCC10.
> > 
> > *Phew* ;-)
> > 
> > 
> > Some trivial comments:
> > 
> >> +static bool
> >> +invalid_insn_for_doloop_p (struct loop *loop)
> >> +{
> >> +  basic_block *body = get_loop_body (loop);
> >> +  unsigned num_nodes = loop->num_nodes;
> >> +  gimple_stmt_iterator gsi;
> >> +  unsigned i;
> >> +
> >> +  for (i = 0; i < num_nodes; i++)
> > 
> >   for (unsigned i = 0; i < num_nodes; i++)
> > 
> > (and maybe you can just say loop->num_nodes here; I don't know if that
> > generates worse code, or if that even matters).
> 
> Good idea, will fix it.
> 
> > 
> >> +      if (dump_file && (dump_flags & TDF_DETAILS))
> >> +        fprintf (
> >> +          dump_file,
> >> +          "predict doloop failure due to finding computed jump.\n");
> > 
> > We don't normally end lines in (.  There are other solutions to why you
> > did that here; you can use string pasting, to break the string into two,
> > or factor out (some part of) the loop body to reduce indentation.
> > 
> 
> Will adjust it.
> 
> > 
> > Segher
> > 
> 
> 

-- 
Richard Biener <[email protected]>
SUSE Linux GmbH, Maxfeldstrasse 5, 90409 Nuernberg, Germany;
GF: Felix Imendörffer, Mary Higgins, Sri Rasiah; HRB 21284 (AG Nürnberg)

Re: [PATCH, RFC, rs6000] PR80791 Consider doloop in ivopts

Reply via email to