On Fri, 26 Apr 2019, Kewen.Lin wrote:
> Hi Segher,
>
> Thanks a lot for your comments!
>
> on 2019/4/25 下午8:16, Segher Boessenkool wrote:
>
> > Does it create worse code now? What we have before your patch isn't
> > so super either (it has an sldi in the loop, it has two mtctr too).
> > Maybe you can show the generated code?
>
> It's a good question! From the generated codes for the core loop, the
> code after my patch doesn't have bdnz to leverage hardware CTR, it has
> extra cmpld and branch instead, looks worse. But I wrote a tiny case
> to invoke the foo and evaluated the running time, they are equal.
>
> * Measured time:
> After:
> real 199.47
> user 198.35
> sys 1.11
> Before:
> real 199.19
> user 198.56
> sys 0.62
>
> * Command I used to compile:
> ${local_gcc} ${case_src}/20050830-1.c -fno-diagnostics-show-caret
> -fno-diagnostics-show-line-numbers -fdiagnostics-color=never -O2
> -ffat-lto-objects -fno-ident -c
>
> * main file (main.c):
> extern void foo();
> #define LOOPNUM 10000
> #define N 1000000*256
> int a[N];
> int main() {
> for(int i = 0; i < LOOPNUM; i++) {
> foo(N);
> }
> }
>
>
> * Generated code sequence:
>
> Before my patch:
>
> cmpwi 0,3,511
> blelr 0
> addi 10,3,-256
> addi 9,3,-512
> addis 8,2,.LC0@toc@ha
> ld 8,.LC0@toc@l(8)
> cmpwi 0,10,256
> rldicl 9,9,56,40
> sldi 3,3,2
> addi 9,9,1
> mtctr 9
> add 8,3,8
> li 10,42
> blt 0,.L7 # jump to L7 it's less than 256
> .L3: # core loop
> stw 10,0(8)
> addi 8,8,-1024
> bdnz .L3
> blr
> .L7:
> li 9,1 # special case, iteration only 1
> mtctr 9
> b .L3
>
> After my patch:
>
> cmpwi 0,3,511
> blelr 0
> addis 7,2,.LC0@toc@ha
> ld 7,.LC0@toc@l(7)
> addi 10,3,-512
> sldi 9,3,2
> rlwinm 10,10,0,0,23
> li 8,42
> subf 10,10,3
> sldi 10,10,2
> addi 6,7,-1024
> add 9,9,7
> add 10,10,6
> .p2align 4,,15
> .L3: # core loop
> stw 8,0(9)
> addi 9,9,-1024
> cmpld 0,9,10 # cmp
> beqlr 0 # if eq, return
> stw 8,0(9)
> addi 9,9,-1024
> cmpld 0,9,10 # cmp again
> bne 0,.L3 # if ne, jump to L3.
> blr
>
> ------------------------
>
> I practiced whether we can adjust the decision made in ivopts.
> For one compare iv use with zero cost, if one iv cand whose base
> is from memory has costly elim_cost before adjust_setup_cost,
> it's possible to make later analysis unable to find it's finite,
> then we try to avoid it.
>
> The diff looks like:
>
> --- a/gcc/tree-ssa-loop-ivopts.c
> +++ b/gcc/tree-ssa-loop-ivopts.c
> @@ -5126,6 +5126,7 @@ determine_group_iv_cost_cond (struct ivopts_data *data,
> tree *control_var, *bound_cst;
> enum tree_code comp = ERROR_MARK;
> struct iv_use *use = group->vuses[0];
> + bool dont_elim_p = false;
>
> /* Extract condition operands. */
> rewrite_type = extract_cond_operands (data, use->stmt, &control_var,
> @@ -5152,6 +5153,16 @@ determine_group_iv_cost_cond (struct ivopts_data *data,
> inv_expr_elim = get_loop_invariant_expr (data, bound);
> bitmap_clear (inv_vars_elim);
> }
> +
> + /* zero cost use makes it easier to select memory based iv cand
> + for replacement of non memory based iv and its use. But if
> + the setup sequence are too costly, loop iv analysis can NOT
> + easily figure out it's finite, it's possible to stop the
> + low-overhead loop transformation and get unexpected code. */
> + if (use->zero_cost_p && cand->iv->base_object && !use->iv->base_object
> + && elim_cost.cost >= 30)
> + dont_elim_p = true;
> +
> /* The bound is a loop invariant, so it will be only computed
> once. */
> elim_cost.cost = adjust_setup_cost (data, elim_cost.cost);
> @@ -5184,7 +5195,7 @@ determine_group_iv_cost_cond (struct ivopts_data *data,
> express_cost += bound_cost;
>
> /* Choose the better approach, preferring the eliminated IV. */
> - if (elim_cost <= express_cost)
> + if (elim_cost <= express_cost && !dont_elim_p)
> {
>
>
> I was thinking whether this zero cost change just exposed
> an existing problem, then this kind of check should be for all
> cases not only for zero cost use, similar to
> expression_expensive_p? But why doesn't it happen before?
> Need more investigation.
We should think about possible ways of encoding doloop at IVOPTs
time rather than trying to re-analyze at RTL. I suppose we can
easily set a flag in struct loop but I'm not familiar enough with
the doloop pass to tell whether that is good enough. Also we
can maybe move doloop to GIMPLE completely? I remember some
targets have loop size constraints here as well, so I guess
that isn't easily possible.
Richard.
> >
> >> Btw, this is for GCC10.
> >
> > *Phew* ;-)
> >
> >
> > Some trivial comments:
> >
> >> +static bool
> >> +invalid_insn_for_doloop_p (struct loop *loop)
> >> +{
> >> + basic_block *body = get_loop_body (loop);
> >> + unsigned num_nodes = loop->num_nodes;
> >> + gimple_stmt_iterator gsi;
> >> + unsigned i;
> >> +
> >> + for (i = 0; i < num_nodes; i++)
> >
> > for (unsigned i = 0; i < num_nodes; i++)
> >
> > (and maybe you can just say loop->num_nodes here; I don't know if that
> > generates worse code, or if that even matters).
>
> Good idea, will fix it.
>
> >
> >> + if (dump_file && (dump_flags & TDF_DETAILS))
> >> + fprintf (
> >> + dump_file,
> >> + "predict doloop failure due to finding computed jump.\n");
> >
> > We don't normally end lines in (. There are other solutions to why you
> > did that here; you can use string pasting, to break the string into two,
> > or factor out (some part of) the loop body to reduce indentation.
> >
>
> Will adjust it.
>
> >
> > Segher
> >
>
>
--
Richard Biener <[email protected]>
SUSE Linux GmbH, Maxfeldstrasse 5, 90409 Nuernberg, Germany;
GF: Felix Imendörffer, Mary Higgins, Sri Rasiah; HRB 21284 (AG Nürnberg)