On Tue, Nov 23, 2021 at 4:20 PM Martin Liška <mli...@suse.cz> wrote:
>
> On 11/23/21 14:58, Richard Biener wrote:
> > On Mon, Nov 22, 2021 at 4:07 PM Martin Liška <mli...@suse.cz> wrote:
> >>
> >> On 11/19/21 11:00, Richard Biener wrote:
> >>> On Tue, Nov 16, 2021 at 3:40 PM Martin Liška <mli...@suse.cz> wrote:
> >>>>
> >>>> On 11/11/21 08:15, Richard Biener wrote:
> >>>>> So I'd try to do no functional change first, improving the costing and
> >>>>> setting up the transform to simply pick up the stmts to "fold" as 
> >>>>> discovered
> >>>>> during analysis (as I hinted you possibly can use gimple_uid to mark
> >>>>> the stmts that simplify, IIRC gimple_uid is preserved during copying.
> >>>>> gimple_uid would also scale better than gimple_plf in case we do
> >>>>> the analysis for all candidates at once).
> >>>>
> >>>> Thinking about the analysis. Am I correct that we want to properly 
> >>>> calculate
> >>>> loop size for true and false edge of a potential gcond before the 
> >>>> actually unswitching?
> >>>
> >>> Yes.
> >>>
> >>>> We can do that by finding a first gcond candidate, evaluate (symbolic + 
> >>>> irange approache)
> >>>> all other gcond in the loop body and use BB_REACHABLE discovery. 
> >>>> Similarly to what we do now
> >>>> at lines 378-446. Then tree_num_loop_insns can be adjusted for only 
> >>>> these reachable blocks.
> >>>> Having that, we can calculate # of insns that will live in true/false 
> >>>> loops.
> >>>
> >>> So whatever we do here we should record as "this control stmt folds to
> >>> {true,false}" (or {true,unknown},
> >>> or in future, "this control stmt will lead to edge {e,unknown}"),
> >>> recording the simplification
> >>> on the true/false loop version in a way we can apply it after the 
> >>> transform.
> >>>
> >>>> Then we can call tree_unswitch_loop and make the gcond folding as we do 
> >>>> in the versioned loops.
> >>>>
> >>>> Is it a step in good direction? Having that we can then extend it to 
> >>>> gswitch statements.
> >>>
> >>> One issue I see is that BB_REACHABLE is there only once but you could use
> >>> auto_bb_flag reachable_true, reachable_false to distinguish the
> >>> true/false loop version
> >>> copies.
> >>>
> >>> So yes, I think that sounds reasonable.  At the point we want to
> >>> evaluate different
> >>> (first) unswitching opportunities against each other storing this only
> >>> as BB flag is
> >>> likely to hit limits.  When we want to evaluate multiple levels of
> >>> unswitching before
> >>> doing any transforms even more so (if there are 3 opportunities there'd be
> >>> many cases to be considered when going to level 3 ;)).  I _think_ that a 
> >>> sparse
> >>> lattice of stmt UID -> edge might do the trick if we change 
> >>> tree_num_loop_insns
> >>> do to a DFS walk from the loop header, ignoring not taken edges by
> >>> consulting the
> >>> lattice.  Or, for speed reason, pre-compute tree_num_loop_insns for each 
> >>> BB
> >>> so we just have to sum a different set of BBs rather than walking all
> >>> stmts again.
> >>>
> >>> That said, the second step would definitely be to choose the "best" 
> >>> opportunity
> >>> on the current level.
> >>>
> >>> Richard.
> >>>
> >>>> Cheers,
> >>>> Martin
> >>
> >> Hello.
> >>
> >> I'm sending a new version where I changed:
> >> 1) all unswitch_predicates are find for a loop
> >> 2) context sensitive costing happens based on an unswitch_predicate and BB 
> >> reachability
> >>      is implemented
> >> 3) folding happens in recursive invocation once we decide to unswitch
> >> 4) the patch folds both symbolic gcond predicates and irange provided by 
> >> ranger
> >> 5) debug counter was added
> >>
> >> Patch can bootstrap on x86_64-linux-gnu and survives regression tests. 
> >> Plus, I tested it
> >> on SPEC2006 and SPEC2017 with -size=ref.
> >
> > Meh, diff made a mess out if this ;)  Random comments, I'm walking
> > myself the optimizations
> > flow.
>
> Sure.
>
> >
> > tree_unswitch_single_loop:
> >
> > +  unswitch_predicate *predicate = NULL;
> > +  if (num > param_max_unswitch_level)
> > +    {
> > +      if (dump_file
> > +         && (dump_flags & TDF_DETAILS))
> > +       fprintf (dump_file, ";; Not unswitching anymore, hit max level\n");
> > +      goto exit;
> > +    }
> >
> > this looks like we can do this check before find_all_unswitching_predicates?
>
> Makes sense.
>
> >
> > +  for (auto pred: candidates)
> > +    {
> > +      unsigned cost
> > +       = evaluate_loop_insns_for_predicate (loop, bbs, ranger, pred);
> > ...
> >
> > so this searches for the first candidate that fits in
> > param_max_unswitch_insns, it doesn't
> > yet try to find the cheapest / best one.  Please add a comment to say
> > that.  After we
> > found one candidate we apply unswitching to such one candidate (and throw 
> > the
> > others away).  I guess that's OK - it's what the old code did - what
> > you did for this
> > intermediate step is actually gather all unswitching predicates
> > upfront.  Hopefully
> > we'll be able to share some of the work done for the recursive invocations.
> >
> > +         fprintf (dump_file, ";; Unswitching loop with condition: ");
> >
> > "on condition"
> >
> > +         fprintf (dump_file, ";; Not unswitching condition, loop too big "
> > +                  "(%d insns): ", cost);
> >
> > "cost too big"?  I assume 'cost' is the number of stmts we'll add,
> > loop-size - true-eliminated - false-eliminated?
>
> I'm going to adjust this.
>
> >
> > +exit:
> > +  for (auto predicate: candidates)
> > +    delete predicate;
> >
> > Some refactoring should get rid of the goto ...
>
> You don't like it? It seems to me quite logical as one does not have to repeat
> a clean up code before each return statement.
>
> >
> > +static unsigned
> > +evaluate_insns (class loop *loop,  basic_block *bbs,
> > +               unswitch_predicate *candidate, bool true_edge,
> > +               gimple_ranger *ranger, auto_bb_flag &reachable_flag)
> > +{
> > ...
> > +                 unswitch_predicate *predicate
> > +                   = tree_may_unswitch_on (bb, loop, ranger);
> > +                 if (predicate != NULL)
> > +                   {
> > +                     tree folded
> > +                       = simplify_using_entry_checks (predicate, candidate,
> > +                                                      true_edge);
> >
> > so just looking at the implementation it seems that if you assign UIDs
> > to all last_stmt() in a loop you can keep tree_may_unswitch_on ()
> > in a dense array and you do not have to re-compute it?  UIDs should
> > survive copying so even for recursive unswitchings the predicates should
> > be shareable (pointing to the wrong stmt then, but then just do not record
> > the stmt in there - it should be available from the caller somehow).
>
> I like the idea, lemme implement it.
>
> >
> > But then I wonder how this translates to gswitch handling where the
> > result is a taken edge / a set of known not taken edges.  I suggest
> > to try wiring in gcond support in a separate commit on top of this to
>
> s/gcond/gswitch, right?
>
> > see whether some refactoring is needed.
>
> Sure, so for e.g. case 1 ... 5 we would need to create a new 
> unswitch_predicate
> with 1 <= index && index <= 5 tree predicate (and the corresponding irange 
> range).
> Later once we unswitch on it, we should use a special unreachable_flag that 
> will
> be used for marking of dead edges (similarly how we fold gconds to 
> boolean_{false/true}_node.
> Does it make sense?
>
> >
> > Looking at evaluate_insns it also seems that 'reachable_flag'
> > is unused - computing the sizes could be done once you process
>
> It is used:
>
>           if (dest->loop_father == loop
>               && !(dest->flags & reachable_flag)
>               && !(e->flags & flags))
>             {
>               worklist.safe_push (dest);
>               dest->flags |= reachable_flag;
>             }
>
> Where's problem?

Ah, OK - it's used as 'visited' as well.  Fine then.

>
> > a block, no?  Previously I suggested to compute estimate_num_insns
> > for each BB of the original loop once so you can re-use that and
> > just need to sum up BB costs.
>
> Yep, makes sense.

I could suggest to use bb->aux ...

(maybe that's even copied by the versioning so that recursions do not need
to recompute stuff - no, it doesn't seem so, we only copy edge->aux it seems
looking at duplicate_block!?)

Anyway - just compute it for each "original" loop then, in the very far end
we hopefully evaluate the whole recursion before doing the transforms.

> >
> >
> > +static bool
> > +find_all_unswitching_predicates (class loop *loop, basic_block *bbs,
> > +                                bool true_edge,
> > +                                unswitch_predicate *parent_predicate,
> > +                                gimple_ranger *ranger,
> > +                                vec<unswitch_predicate *> &candidates)
> > +{
> > ...
> >
> > so you re-do the simplification step here, in fact this function seems
> > to not only find unswitching predicates but apply simplifications
> > from the previous unswitching step.  (yes, that's what the old code did ...)
> > The costing step already did all the work though, it should be
> > possible to keep a vector of stmts to simplify and the simplification
> > from that analysis time.
>
> Makes sense.
>
> >
> > You can get at the "other loop copy" stmt via
> >
> >    last_stmt (get_bb_copy (gimple_bb (cond)))
> >
> > if you do that before free_original_copy_tables.  I think it would be
> > nice to have the simplification step explicit before recursing
> > even if that means some duplicate walking for now
> > (I do hope we can share the find_all_unswitching_predicates
> > analysis step)
>
> Nods.
>
> >
> > You are applying param_max_unswitch_insns with
> >
> > +      unsigned cost
> > +       = evaluate_loop_insns_for_predicate (loop, bbs, ranger, pred);
> >
> > -      cond = simplify_using_entry_checks (loop, cond);
> > -      stmt = last_stmt (bbs[i]);
> > -      if (integer_nonzerop (cond))
> > +      if (cost <= (unsigned)param_max_unswitch_insns)
> >
> > where cost is the sum of the simplified sizes of both loop copies after
> > unswitching.  By its very nature this will match for all recursive 
> > invocations
> > (loops only will get smaller) which is why we need param_max_unswitch_level.
> > The idea was to have one param_max_unswitch_insns budget for each
> > _original_ loop the recursive invocations will eat from, removing the need
> > for param_max_unswitch_level.  I'd re-interpret param_max_unswitch_insns
> > as the number of insns we may add, so with cost calculated as above we'd
> > have
> >
> >      if (cost <= budget)
> > ...
> >
> > and
> >
> >      budget -= cost;
> >
> > before the recursion (and pass by reference so that the two recursions
> > share the budget).
> >
> > Initialize by
> >
> >    budget = original_loop_size + param_max_unswitch_insns;
>
> Ah, ok, lemme change that.
>
> >
> > Hope this wasn't too many comments ;)
>
> It was, but they are all very useful.
>
> Thanks,
> Martin
>
> >
> > Thanks,
> > Richard.
> >
> >> Thoughts?
> >> Thanks,
> >> Martin
> >>
>

Reply via email to