[Bug target/96789] x264: sub4x4_dct() improves when vectorization is disabled

rguenth at gcc dot gnu.org via Gcc-bugs Mon, 28 Sep 2020 06:28:44 -0700

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96789


--- Comment #33 from Richard Biener <rguenth at gcc dot gnu.org> ---
(In reply to Kewen Lin from comment #32)
> (In reply to Richard Biener from comment #31)
> > (In reply to Kewen Lin from comment #29)
> > > (In reply to Hongtao.liu from comment #28)
> > > > > Probably you can try to tweak it in ix86_add_stmt_cost? when the 
> > > > > statement
> > > > 
> > > > Yes, it's the place.
> > > > 
> > > > > is UB to UH conversion statement, further check if the def of the 
> > > > > input UB
> > > > > is MEM.
> > > > 
> > > > Only if there's no multi-use for UB. More generally, it's quite 
> > > > difficult to
> > > > guess later optimizations for the purpose of more accurate vectorization
> > > > cost model, :(.
> > > 
> > > Yeah, it's hard sadly. The generic cost modeling is rough,
> > > ix86_add_stmt_cost is more fine-grain (at least than what we have on Power
> > > :)), if you want to check it more, it seems doable in target specific hook
> > > finish_cost where you can get the whole vinfo object, but it could end up
> > > with very heavy analysis and might not be worthy.
> > > 
> > > Do you mind to check if it can also fix this degradation on x86 to run FRE
> > > and DSE just after cunroll? I found it worked for Power, hoped it can help
> > > there too.
> > 
> > Btw, we could try sth like adding a TODO_force_next_scalar_cleanup to be
> > returned from passes that see cleanup opportunities and have the pass
> > manager queue that up, looking for a special marked pass and enabling
> > that so we could have
> > 
> >           NEXT_PASS (pass_predcom);
> >           NEXT_PASS (pass_complete_unroll);
> >           NEXT_PASS (pass_scalar_cleanup);
> >           PUSH_INSERT_PASSES_WITHIN (pass_scalar_cleanup);
> >             NEXT_PASS (pass_fre, false /* may_iterate */);
> >             NEXT_PASS (pass_dse);
> >           POP_INSERT_PASSES ();
> > 
> > with pass_scalar_cleanup gate() returning false otherwise.  Eventually
> > pass properties would match this better, or sth else.
> > 
> 
> Thanks for the suggestion! Before cooking the patch, I have one question
> that it looks to only update function property is enough, eg: some pass sets
> property PROP_ok_for_cleanup and later pass_scalar_cleanup only goes for the
> func with this property (checking in gate), I'm not quite sure the reason
> for the TODO_flag TODO_force_next_scalar_cleanup.

properties are not an easy fit since they are static in the pass
description while we want to trigger the cleanup only if we unrolled
an outermost loop for example.  Returning TODO_force_next_scalar_cleanup
from cunroll is the natural way to signal this.

The hackish way then would be to "queue" this TODO_force_next_scalar_cleanup
inside the pass manager itself until it runs into a
"scalar cleanup pass" (either somehow marked or just hard-matched), forcing
its gate() to evaluate to true (just skipping its evaluation).

As said, there's no "nice" way for the information flow at the moment.

One could expose the "pending TODO" (TODO not handled by the pass manager
itself) in a global variable (like current_pass) so the cleanup pass
gate() could check

  gate () { return pending_todo & TODO_force_next_scalar_cleanup; }

and in its execute clear this bit from pending_todo.

[Bug target/96789] x264: sub4x4_dct() improves when vectorization is disabled

Reply via email to