On Wed, 17 Aug 2022, Aldy Hernandez wrote:
> On Wed, Aug 17, 2022 at 9:54 AM Richard Biener <[email protected]> wrote:
> >
> > On Wed, 17 Aug 2022, Aldy Hernandez wrote:
> >
> > > I just have a few high level comments.
> > >
> > > On Tue, Aug 16, 2022 at 4:05 PM Richard Biener <[email protected]> wrote:
> > > >
> > > > The following refactors profitable_path_p in the backward threader,
> > > > splitting out parts that can be computed once the exit block is known,
> > > > parts that contiguously update and that can be checked allowing
> > > > for the path to be later identified as FSM with larger limits,
> > > > possibly_profitable_path_p, and final checks done when the whole
> > > > path is known, profitable_path_p.
> > >
> > > I thought we were removing references to FSM, as they were leftovers
> > > from some previous incarnation. For that matter, I don't think I ever
> > > understood what they are, so if we're gonna keep them, could you
> > > comment what makes FSM threads different from other threads?
> >
> > I don't know exactly what 'FSM' stands for but the FSM threader
> > specifically tried to cover
>
> It probably stands for finite state machine then?
>
> >
> > for (;;)
> > {
> > switch (state)
> > {
> > case 1:
> > /* sth */
> > state = 3;
> > break;
> > ...
> > case 3:
> > ...
> > }
> > }
> >
> > so optimizing state machine transitions. That is, these are all
> > multiway (switch or goto, but goto support has been removed with the
> > ranger rewrite it seems) and the thread path going through the
>
> ISTR going through the computed goto path in the old code, and it
> never got triggered. I just didn't get around to removing the
> references to GIMPLE_GOTO. At least, the old threader not once got a
> thread we didn't get with the new code, even through a full Fedora
> build. I could be wrong though, but that's my recollection.
When one massages the threader to even consider gotos we eventually
run into find_taken_edge not handling them because range_of_expr
computes the range of 'state' as
[irange] void * [1, +INF]$4 = void
rather than &&E.
A testcase would be
unsigned g;
void FSM (int start)
{
void *states[] = { &&A, &&B, &&C, &&E };
void *state = states[start];
do {
goto *state;
A:
g += 1;
state = g & 1 ? &&B : &&E;
continue;
B:
g += 2;
state = &&C;
continue;
C:
g += 3;
state = states[g & 3];
continue;
E:
break;
} while (1);
}
where we'd expect to thread the B -> C state transition. I checked
gcc 7 and it doesn't do that - not sure if it broke somewhen
on the way or if it was just never working. I'll file a bugreport,
OTOH &label and goto *p are both GNU extensions, so not sure if it
is worth optimizing. I'll attach the "simple" enabler I have.
> > loop backedge. This is why FSM has different size limits because
> > it was thought those threadings are especially worthwhile and
> > they tend to be larger. To make discovery cheaper a TODO item
> > would be to skip to the loop backedge when we reach the regular
> > thread limit and only continue the real search there (and
> > pick the "smallest" ways through any diamonds on the way).
>
> Ah, thanks. If you could, add some similar blurb, it'd be great. I'm
> sure we'll all forget it at some time (well, I will :)).
Sure ;)
Richard.
> Aldy
>
> >
> > > In your possibly_profitable_path_p function, could you document a bit
> > > better what's the difference between profitable_path_p and
> > > possibly_profitable_path_p?
> >
> > Sure.
> >
> > > >
> > > > I've removed the back_threader_profitability instance from the
> > > > back_threader class and instead instantiate it once per path
> > > > discovery. I've kept the size compute non-incremental to simplify
> > > > the patch and not worry about unwinding.
> > > >
> > > > There's key changes to previous behavior - namely we apply
> > > > the param_max_jump_thread_duplication_stmts early only when
> > > > we know the path cannot become an FSM one (multiway + thread through
> > > > latch) but make sure to elide the path query when we we didn't
> > > > yet discover that but are over this limit. Similarly the
> > > > speed limit is now used even when we did not yet discover a
> > > > hot BB on the path. Basically the idea is to only stop path
> > > > discovery when we know the path will never become profitable
> > > > but avoid the expensive path range query when we know it's
> > > > currently not.
> > > >
> > > > I've done a few cleanups, merging functions, on the way.
> > > >
> > > > Bootstrapped and tested on x86_64-unknown-linux-gnu.
> > > >
> > > > Statistics show an overall slight increase in threading but
> > > > looking at different files theres noise up and down. That's
> > > > somewhat expected since we now are applying the "more correct"
> > > > limits in the end. Unless I made big mistakes of course.
> > > >
> > > > The next thing cost-wise would be to raise the backwards
> > > > threading limit to the limit of DOM so we don't get
> > > > artificial high counts for that.
> > >
> > > The DOM threader has limits? I thought most of those limits were just
> > > due to the fact that it couldn't determine long enough paths? Either
> > > way, I like that we're merging the necessary forward threader bits
> > > here, in preparation for its demise ;-).
> >
> > Both use param_max_jump_thread_duplication_stmts, but the backwards
> > threader applies this limit to non-FSM threads with
> >
> > /* The generic copier used by the backthreader does not re-use an
> > existing threading path to reduce code duplication. So for that
> > case, drastically reduce the number of statements we are allowed
> > to copy. */
> > if (!(threaded_through_latch && threaded_multiway_branch)
> > && (n_insns * param_fsm_scale_path_stmts
> > >= param_max_jump_thread_duplication_stmts))
> >
> > and param_fsm_scale_path_stmts happens to be two. I have no idea
> > why we apply this scaling, the scaling is otherwise used in
> >
> > /* We avoid creating irreducible inner loops unless we thread through
> > a multiway branch, in which case we have deemed it worth losing
> > other loop optimizations later.
> >
> > We also consider it worth creating an irreducible inner loop if
> > the number of copied statement is low relative to the length of
> > the path -- in that case there's little the traditional loop
> > optimizer would have done anyway, so an irreducible loop is not
> > so bad. */
> > if (!threaded_multiway_branch
> > && creates_irreducible_loop
> > && *creates_irreducible_loop
> > && (n_insns * (unsigned) param_fsm_scale_path_stmts
> > > (m_path.length () *
> > (unsigned) param_fsm_scale_path_blocks)))
> >
> > so my idea is to drop the scaling from applying the
> > param_max_jump_thread_duplication_stmts limit which raises the
> > effective default limit from 15 / 2 to 15, just like what DOM
> > applies (where DOM also estimates some optimization on the path,
> > reducing the stmt count).
> >
> > > Looks good.
> >
> > Thanks - I'll adjust the commentary and push.
> >
> > Richard.
> >
>
>
--
Richard Biener <[email protected]>
SUSE Software Solutions Germany GmbH, Frankenstrasse 146, 90461 Nuernberg,
Germany; GF: Ivo Totev, Andrew Myers, Andrew McDonald, Boudien Moerman;
HRB 36809 (AG Nuernberg)