Re: [PATCH] Refactor back_threader_profitability

Aldy Hernandez via Gcc-patches Wed, 17 Aug 2022 02:27:35 -0700

On Wed, Aug 17, 2022 at 10:59 AM Richard Biener <[email protected]> wrote:
>
> On Wed, 17 Aug 2022, Aldy Hernandez wrote:
>
> > On Wed, Aug 17, 2022 at 10:38 AM Richard Biener <[email protected]> wrote:
> > >
> > > On Wed, 17 Aug 2022, Aldy Hernandez wrote:
> > >
> > > > On Wed, Aug 17, 2022 at 9:54 AM Richard Biener <[email protected]> 
> > > > wrote:
> > > > >
> > > > > On Wed, 17 Aug 2022, Aldy Hernandez wrote:
> > > > >
> > > > > > I just have a few high level comments.
> > > > > >
> > > > > > On Tue, Aug 16, 2022 at 4:05 PM Richard Biener <[email protected]> 
> > > > > > wrote:
> > > > > > >
> > > > > > > The following refactors profitable_path_p in the backward 
> > > > > > > threader,
> > > > > > > splitting out parts that can be computed once the exit block is 
> > > > > > > known,
> > > > > > > parts that contiguously update and that can be checked allowing
> > > > > > > for the path to be later identified as FSM with larger limits,
> > > > > > > possibly_profitable_path_p, and final checks done when the whole
> > > > > > > path is known, profitable_path_p.
> > > > > >
> > > > > > I thought we were removing references to FSM, as they were leftovers
> > > > > > from some previous incarnation.  For that matter, I don't think I 
> > > > > > ever
> > > > > > understood what they are, so if we're gonna keep them, could you
> > > > > > comment what makes FSM threads different from other threads?
> > > > >
> > > > > I don't know exactly what 'FSM' stands for but the FSM threader
> > > > > specifically tried to cover
> > > >
> > > > It probably stands for finite state machine then?
> > > >
> > > > >
> > > > > for (;;)
> > > > >  {
> > > > >    switch (state)
> > > > >     {
> > > > >     case 1:
> > > > >       /* sth */
> > > > >       state = 3;
> > > > >       break;
> > > > >     ...
> > > > >     case 3:
> > > > >       ...
> > > > >     }
> > > > > }
> > > > >
> > > > > so optimizing state machine transitions.  That is, these are all
> > > > > multiway (switch or goto, but goto support has been removed with the
> > > > > ranger rewrite it seems) and the thread path going through the
> > > >
> > > > ISTR going through the computed goto path in the old code, and it
> > > > never got triggered.  I just didn't get around to removing the
> > > > references to GIMPLE_GOTO.  At least, the old threader not once got a
> > > > thread we didn't get with the new code, even through a full Fedora
> > > > build.  I could be wrong though, but that's my recollection.
> > >
> > > When one massages the threader to even consider gotos we eventually
> > > run into find_taken_edge not handling them because range_of_expr
> > > computes the range of 'state' as
> > >
> > > [irange] void * [1, +INF]$4 = void
> > >
> > > rather than &&E.
> > >
> > > A testcase would be
> > >
> > > unsigned g;
> > > void FSM (int start)
> > > {
> > >   void *states[] = { &&A, &&B, &&C, &&E };
> > >   void *state = states[start];
> > >
> > >   do {
> > >   goto *state;
> > >
> > > A:
> > >   g += 1;
> > >   state = g & 1 ? &&B : &&E;
> > >   continue;
> > >
> > > B:
> > >   g += 2;
> > >   state = &&C;
> > >   continue;
> > >
> > > C:
> > >   g += 3;
> > >   state = states[g & 3];
> > >   continue;
> > >
> > > E:
> > >   break;
> > >   } while (1);
> > > }
> > >
> > > where we'd expect to thread the B -> C state transition.  I checked
> > > gcc 7 and it doesn't do that - not sure if it broke somewhen
> > > on the way or if it was just never working.  I'll file a bugreport,
> > > OTOH &label and goto *p are both GNU extensions, so not sure if it
> > > is worth optimizing.  I'll attach the "simple" enabler I have.
> >
> > Yeah, that's what I thought.  I seem to remember tracing through the
> > old code and never finding a path that would handle computed gotos.
> > Either way, thanks for distilling a testcase.  If you could file a PR
> > and CC me on it, that'd be great.
> >
> > When I contribute prange (irange for pointers), the plan is to have it
> > handle pointer equivalences, and we should be able to get the address
> > from the prange itself.
>
> Well, ISTR seeing &decl in integer ranges so I wondered why it doesn't
> just work ;)  tracing ranger shows


&decl in integer ranges are leftovers from legacy evrp/VRP.  Iranges
are constant only, but we make an exception when
irange::legacy_mode_p() is true so legacy VRP can fit in whatever it
needs.  But ranger will never put a non-integer in an irange.  There's
probably an assert to make sure a symbolic never slips into a "modern"
irange.

The proof of concept I have for prange still has integer endpoints
(well, zero, nonzero, since that's what we care about).  But range-ops
will also keep track of equivalences that we can query with
prange::get_equiv() or some such.

So yeah, ranger will never give you &decl with irange, but prange may
give us what we need.

Aldy

>
> Checking profitability of path (backwards):  bb:11 (0 insns) bb:4 (latch)
>   Control statement insns: 0
>   Overall: 0 insns
> 1        range_of_expr(state_9) at stmt state_9 = PHI <&E(6), state_23(9),
> &C(8), &B(5)>
> 2          range_of_stmt (state_9) at stmt state_9 = PHI <&E(6),
> state_23(9), &C(8), &B(5)>
> 3            ROS dependence fill
>                ROS dep fill (state_9) at stmt state_9 = PHI <&E(6),
> state_23(9), &C(8), &B(5)>
>              FALSE : (3) ROS  (state_9)
> 4            range_on_edge (&E) on edge 6->4
>              TRUE : (4) range_on_edge (&E) [irange] void * [1, +INF]
>
> so why's the range not &E but non-NULL instead?  Seems to be
> range_query::get_tree_range doing.
>
> Richard.
>
> > Aldy
> >
> > >
> > > > > loop backedge.  This is why FSM has different size limits because
> > > > > it was thought those threadings are especially worthwhile and
> > > > > they tend to be larger.  To make discovery cheaper a TODO item
> > > > > would be to skip to the loop backedge when we reach the regular
> > > > > thread limit and only continue the real search there (and
> > > > > pick the "smallest" ways through any diamonds on the way).
> > > >
> > > > Ah, thanks.  If you could, add some similar blurb, it'd be great.  I'm
> > > > sure we'll all forget it at some time (well, I will :)).
> > >
> > > Sure ;)
> > >
> > > Richard.
> > >
> > > > Aldy
> > > >
> > > > >
> > > > > > In your possibly_profitable_path_p function, could you document a 
> > > > > > bit
> > > > > > better what's the difference between profitable_path_p and
> > > > > > possibly_profitable_path_p?
> > > > >
> > > > > Sure.
> > > > >
> > > > > > >
> > > > > > > I've removed the back_threader_profitability instance from the
> > > > > > > back_threader class and instead instantiate it once per path
> > > > > > > discovery.  I've kept the size compute non-incremental to simplify
> > > > > > > the patch and not worry about unwinding.
> > > > > > >
> > > > > > > There's key changes to previous behavior - namely we apply
> > > > > > > the param_max_jump_thread_duplication_stmts early only when
> > > > > > > we know the path cannot become an FSM one (multiway + thread 
> > > > > > > through
> > > > > > > latch) but make sure to elide the path query when we we didn't
> > > > > > > yet discover that but are over this limit.  Similarly the
> > > > > > > speed limit is now used even when we did not yet discover a
> > > > > > > hot BB on the path.  Basically the idea is to only stop path
> > > > > > > discovery when we know the path will never become profitable
> > > > > > > but avoid the expensive path range query when we know it's
> > > > > > > currently not.
> > > > > > >
> > > > > > > I've done a few cleanups, merging functions, on the way.
> > > > > > >
> > > > > > > Bootstrapped and tested on x86_64-unknown-linux-gnu.
> > > > > > >
> > > > > > > Statistics show an overall slight increase in threading but
> > > > > > > looking at different files theres noise up and down.  That's
> > > > > > > somewhat expected since we now are applying the "more correct"
> > > > > > > limits in the end.  Unless I made big mistakes of course.
> > > > > > >
> > > > > > > The next thing cost-wise would be to raise the backwards
> > > > > > > threading limit to the limit of DOM so we don't get
> > > > > > > artificial high counts for that.
> > > > > >
> > > > > > The DOM threader has limits?  I thought most of those limits were 
> > > > > > just
> > > > > > due to the fact that it couldn't determine long enough paths?  
> > > > > > Either
> > > > > > way, I like that we're merging the necessary forward threader bits
> > > > > > here, in preparation for its demise ;-).
> > > > >
> > > > > Both use param_max_jump_thread_duplication_stmts, but the backwards
> > > > > threader applies this limit to non-FSM threads with
> > > > >
> > > > >   /* The generic copier used by the backthreader does not re-use an
> > > > >      existing threading path to reduce code duplication.  So for that
> > > > >      case, drastically reduce the number of statements we are allowed
> > > > >      to copy.  */
> > > > >   if (!(threaded_through_latch && threaded_multiway_branch)
> > > > >       && (n_insns * param_fsm_scale_path_stmts
> > > > >           >= param_max_jump_thread_duplication_stmts))
> > > > >
> > > > > and param_fsm_scale_path_stmts happens to be two.  I have no idea
> > > > > why we apply this scaling, the scaling is otherwise used in
> > > > >
> > > > >   /* We avoid creating irreducible inner loops unless we thread 
> > > > > through
> > > > >      a multiway branch, in which case we have deemed it worth losing
> > > > >      other loop optimizations later.
> > > > >
> > > > >      We also consider it worth creating an irreducible inner loop if
> > > > >      the number of copied statement is low relative to the length of
> > > > >      the path -- in that case there's little the traditional loop
> > > > >      optimizer would have done anyway, so an irreducible loop is not
> > > > >      so bad.  */
> > > > >   if (!threaded_multiway_branch
> > > > >       && creates_irreducible_loop
> > > > >       && *creates_irreducible_loop
> > > > >       && (n_insns * (unsigned) param_fsm_scale_path_stmts
> > > > >           > (m_path.length () *
> > > > >              (unsigned) param_fsm_scale_path_blocks)))
> > > > >
> > > > > so my idea is to drop the scaling from applying the
> > > > > param_max_jump_thread_duplication_stmts limit which raises the
> > > > > effective default limit from 15 / 2 to 15, just like what DOM
> > > > > applies (where DOM also estimates some optimization on the path,
> > > > > reducing the stmt count).
> > > > >
> > > > > > Looks good.
> > > > >
> > > > > Thanks - I'll adjust the commentary and push.
> > > > >
> > > > > Richard.
> > > > >
> > > >
> > > >
> > >
> > > --
> > > Richard Biener <[email protected]>
> > > SUSE Software Solutions Germany GmbH, Frankenstrasse 146, 90461 Nuernberg,
> > > Germany; GF: Ivo Totev, Andrew Myers, Andrew McDonald, Boudien Moerman;
> > > HRB 36809 (AG Nuernberg)
> > >
> >
> >
>
> --
> Richard Biener <[email protected]>
> SUSE Software Solutions Germany GmbH, Frankenstrasse 146, 90461 Nuernberg,
> Germany; GF: Ivo Totev, Andrew Myers, Andrew McDonald, Boudien Moerman;
> HRB 36809 (AG Nuernberg)
>

Re: [PATCH] Refactor back_threader_profitability

Reply via email to