On Fri, 26 Apr 2019, Kewen.Lin wrote: > Hi Segher, > > Thanks a lot for your comments! > > on 2019/4/25 下午8:16, Segher Boessenkool wrote: > > > Does it create worse code now? What we have before your patch isn't > > so super either (it has an sldi in the loop, it has two mtctr too). > > Maybe you can show the generated code? > > It's a good question! From the generated codes for the core loop, the > code after my patch doesn't have bdnz to leverage hardware CTR, it has > extra cmpld and branch instead, looks worse. But I wrote a tiny case > to invoke the foo and evaluated the running time, they are equal. > > * Measured time: > After: > real 199.47 > user 198.35 > sys 1.11 > Before: > real 199.19 > user 198.56 > sys 0.62 > > * Command I used to compile: > ${local_gcc} ${case_src}/20050830-1.c -fno-diagnostics-show-caret > -fno-diagnostics-show-line-numbers -fdiagnostics-color=never -O2 > -ffat-lto-objects -fno-ident -c > > * main file (main.c): > extern void foo(); > #define LOOPNUM 10000 > #define N 1000000*256 > int a[N]; > int main() { > for(int i = 0; i < LOOPNUM; i++) { > foo(N); > } > } > > > * Generated code sequence: > > Before my patch: > > cmpwi 0,3,511 > blelr 0 > addi 10,3,-256 > addi 9,3,-512 > addis 8,2,.LC0@toc@ha > ld 8,.LC0@toc@l(8) > cmpwi 0,10,256 > rldicl 9,9,56,40 > sldi 3,3,2 > addi 9,9,1 > mtctr 9 > add 8,3,8 > li 10,42 > blt 0,.L7 # jump to L7 it's less than 256 > .L3: # core loop > stw 10,0(8) > addi 8,8,-1024 > bdnz .L3 > blr > .L7: > li 9,1 # special case, iteration only 1 > mtctr 9 > b .L3 > > After my patch: > > cmpwi 0,3,511 > blelr 0 > addis 7,2,.LC0@toc@ha > ld 7,.LC0@toc@l(7) > addi 10,3,-512 > sldi 9,3,2 > rlwinm 10,10,0,0,23 > li 8,42 > subf 10,10,3 > sldi 10,10,2 > addi 6,7,-1024 > add 9,9,7 > add 10,10,6 > .p2align 4,,15 > .L3: # core loop > stw 8,0(9) > addi 9,9,-1024 > cmpld 0,9,10 # cmp > beqlr 0 # if eq, return > stw 8,0(9) > addi 9,9,-1024 > cmpld 0,9,10 # cmp again > bne 0,.L3 # if ne, jump to L3. > blr > > ------------------------ > > I practiced whether we can adjust the decision made in ivopts. > For one compare iv use with zero cost, if one iv cand whose base > is from memory has costly elim_cost before adjust_setup_cost, > it's possible to make later analysis unable to find it's finite, > then we try to avoid it. > > The diff looks like: > > --- a/gcc/tree-ssa-loop-ivopts.c > +++ b/gcc/tree-ssa-loop-ivopts.c > @@ -5126,6 +5126,7 @@ determine_group_iv_cost_cond (struct ivopts_data *data, > tree *control_var, *bound_cst; > enum tree_code comp = ERROR_MARK; > struct iv_use *use = group->vuses[0]; > + bool dont_elim_p = false; > > /* Extract condition operands. */ > rewrite_type = extract_cond_operands (data, use->stmt, &control_var, > @@ -5152,6 +5153,16 @@ determine_group_iv_cost_cond (struct ivopts_data *data, > inv_expr_elim = get_loop_invariant_expr (data, bound); > bitmap_clear (inv_vars_elim); > } > + > + /* zero cost use makes it easier to select memory based iv cand > + for replacement of non memory based iv and its use. But if > + the setup sequence are too costly, loop iv analysis can NOT > + easily figure out it's finite, it's possible to stop the > + low-overhead loop transformation and get unexpected code. */ > + if (use->zero_cost_p && cand->iv->base_object && !use->iv->base_object > + && elim_cost.cost >= 30) > + dont_elim_p = true; > + > /* The bound is a loop invariant, so it will be only computed > once. */ > elim_cost.cost = adjust_setup_cost (data, elim_cost.cost); > @@ -5184,7 +5195,7 @@ determine_group_iv_cost_cond (struct ivopts_data *data, > express_cost += bound_cost; > > /* Choose the better approach, preferring the eliminated IV. */ > - if (elim_cost <= express_cost) > + if (elim_cost <= express_cost && !dont_elim_p) > { > > > I was thinking whether this zero cost change just exposed > an existing problem, then this kind of check should be for all > cases not only for zero cost use, similar to > expression_expensive_p? But why doesn't it happen before? > Need more investigation.
We should think about possible ways of encoding doloop at IVOPTs time rather than trying to re-analyze at RTL. I suppose we can easily set a flag in struct loop but I'm not familiar enough with the doloop pass to tell whether that is good enough. Also we can maybe move doloop to GIMPLE completely? I remember some targets have loop size constraints here as well, so I guess that isn't easily possible. Richard. > > > >> Btw, this is for GCC10. > > > > *Phew* ;-) > > > > > > Some trivial comments: > > > >> +static bool > >> +invalid_insn_for_doloop_p (struct loop *loop) > >> +{ > >> + basic_block *body = get_loop_body (loop); > >> + unsigned num_nodes = loop->num_nodes; > >> + gimple_stmt_iterator gsi; > >> + unsigned i; > >> + > >> + for (i = 0; i < num_nodes; i++) > > > > for (unsigned i = 0; i < num_nodes; i++) > > > > (and maybe you can just say loop->num_nodes here; I don't know if that > > generates worse code, or if that even matters). > > Good idea, will fix it. > > > > >> + if (dump_file && (dump_flags & TDF_DETAILS)) > >> + fprintf ( > >> + dump_file, > >> + "predict doloop failure due to finding computed jump.\n"); > > > > We don't normally end lines in (. There are other solutions to why you > > did that here; you can use string pasting, to break the string into two, > > or factor out (some part of) the loop body to reduce indentation. > > > > Will adjust it. > > > > > Segher > > > > -- Richard Biener <rguent...@suse.de> SUSE Linux GmbH, Maxfeldstrasse 5, 90409 Nuernberg, Germany; GF: Felix Imendörffer, Mary Higgins, Sri Rasiah; HRB 21284 (AG Nürnberg)