On Fri, Apr 26, 2019 at 2:44 PM Kewen.Lin <li...@linux.ibm.com> wrote: > > Hi Segher, > > Thanks a lot for your comments! > > on 2019/4/25 下午8:16, Segher Boessenkool wrote: > > > Does it create worse code now? What we have before your patch isn't > > so super either (it has an sldi in the loop, it has two mtctr too). > > Maybe you can show the generated code? > > It's a good question! From the generated codes for the core loop, the > code after my patch doesn't have bdnz to leverage hardware CTR, it has > extra cmpld and branch instead, looks worse. But I wrote a tiny case > to invoke the foo and evaluated the running time, they are equal. > > * Measured time: > After: > real 199.47 > user 198.35 > sys 1.11 > Before: > real 199.19 > user 198.56 > sys 0.62 > > * Command I used to compile: > ${local_gcc} ${case_src}/20050830-1.c -fno-diagnostics-show-caret > -fno-diagnostics-show-line-numbers -fdiagnostics-color=never -O2 > -ffat-lto-objects -fno-ident -c > > * main file (main.c): > extern void foo(); > #define LOOPNUM 10000 > #define N 1000000*256 > int a[N]; > int main() { > for(int i = 0; i < LOOPNUM; i++) { > foo(N); > } > } > > > * Generated code sequence: > > Before my patch: > > cmpwi 0,3,511 > blelr 0 > addi 10,3,-256 > addi 9,3,-512 > addis 8,2,.LC0@toc@ha > ld 8,.LC0@toc@l(8) > cmpwi 0,10,256 > rldicl 9,9,56,40 > sldi 3,3,2 > addi 9,9,1 > mtctr 9 > add 8,3,8 > li 10,42 > blt 0,.L7 # jump to L7 it's less than 256 > .L3: # core loop > stw 10,0(8) > addi 8,8,-1024 > bdnz .L3 > blr > .L7: > li 9,1 # special case, iteration only 1 > mtctr 9 > b .L3 > > After my patch: > > cmpwi 0,3,511 > blelr 0 > addis 7,2,.LC0@toc@ha > ld 7,.LC0@toc@l(7) > addi 10,3,-512 > sldi 9,3,2 > rlwinm 10,10,0,0,23 > li 8,42 > subf 10,10,3 > sldi 10,10,2 > addi 6,7,-1024 > add 9,9,7 > add 10,10,6 > .p2align 4,,15 > .L3: # core loop > stw 8,0(9) > addi 9,9,-1024 > cmpld 0,9,10 # cmp > beqlr 0 # if eq, return > stw 8,0(9) > addi 9,9,-1024 > cmpld 0,9,10 # cmp again > bne 0,.L3 # if ne, jump to L3. > blr > > ------------------------ > > I practiced whether we can adjust the decision made in ivopts. > For one compare iv use with zero cost, if one iv cand whose base > is from memory has costly elim_cost before adjust_setup_cost, > it's possible to make later analysis unable to find it's finite, > then we try to avoid it. > > The diff looks like: > > --- a/gcc/tree-ssa-loop-ivopts.c > +++ b/gcc/tree-ssa-loop-ivopts.c > @@ -5126,6 +5126,7 @@ determine_group_iv_cost_cond (struct ivopts_data *data, > tree *control_var, *bound_cst; > enum tree_code comp = ERROR_MARK; > struct iv_use *use = group->vuses[0]; > + bool dont_elim_p = false; > > /* Extract condition operands. */ > rewrite_type = extract_cond_operands (data, use->stmt, &control_var, > @@ -5152,6 +5153,16 @@ determine_group_iv_cost_cond (struct ivopts_data *data, > inv_expr_elim = get_loop_invariant_expr (data, bound); > bitmap_clear (inv_vars_elim); > } > + > + /* zero cost use makes it easier to select memory based iv cand > + for replacement of non memory based iv and its use. But if > + the setup sequence are too costly, loop iv analysis can NOT > + easily figure out it's finite, it's possible to stop the > + low-overhead loop transformation and get unexpected code. */ > + if (use->zero_cost_p && cand->iv->base_object && !use->iv->base_object > + && elim_cost.cost >= 30) > + dont_elim_p = true; No, we'd like to avoid such things in general. The conditions look like a hack to me. elim_cost is compared to express_cost, adding another check on it at different place isn't really good, especially 30 itself is a magic number. It's most likely improvement in some cases, deterioration in others.
Also it punishes one pass (IVOPTs here) because of other pass' (RTL) problem. It does't mean we can't do such transformations, only it has to be as precise/conservative as possible. For example, if RTL loop iv is improved to handle the case in the future, who would remember to come back and adjust this? GCC lacks the capability passing information to later passes. Gimple analyzer worked hard collecting various information but discards it entering RTL or earlier. Other examples are like runtime alias information, non-wrapping information for specific operations, etc. IMHO, this is what needs to be done. As for this case, it could be finite loop info, or non-wrapping info of the iv_var's increment operation. By passing more information, RTL passes can be simplified too. Thanks, bin > + > /* The bound is a loop invariant, so it will be only computed > once. */ > elim_cost.cost = adjust_setup_cost (data, elim_cost.cost); > @@ -5184,7 +5195,7 @@ determine_group_iv_cost_cond (struct ivopts_data *data, > express_cost += bound_cost; > > /* Choose the better approach, preferring the eliminated IV. */ > - if (elim_cost <= express_cost) > + if (elim_cost <= express_cost && !dont_elim_p) > { > > > I was thinking whether this zero cost change just exposed > an existing problem, then this kind of check should be for all > cases not only for zero cost use, similar to > expression_expensive_p? But why doesn't it happen before? > Need more investigation. > > > > >> Btw, this is for GCC10. > > > > *Phew* ;-) > > > > > > Some trivial comments: > > > >> +static bool > >> +invalid_insn_for_doloop_p (struct loop *loop) > >> +{ > >> + basic_block *body = get_loop_body (loop); > >> + unsigned num_nodes = loop->num_nodes; > >> + gimple_stmt_iterator gsi; > >> + unsigned i; > >> + > >> + for (i = 0; i < num_nodes; i++) > > > > for (unsigned i = 0; i < num_nodes; i++) > > > > (and maybe you can just say loop->num_nodes here; I don't know if that > > generates worse code, or if that even matters). > > Good idea, will fix it. > > > > >> + if (dump_file && (dump_flags & TDF_DETAILS)) > >> + fprintf ( > >> + dump_file, > >> + "predict doloop failure due to finding computed jump.\n"); > > > > We don't normally end lines in (. There are other solutions to why you > > did that here; you can use string pasting, to break the string into two, > > or factor out (some part of) the loop body to reduce indentation. > > > > Will adjust it. > > > > > Segher > > >