gcc-12-20210704 is now available

2021-07-04 Thread GCC Administrator via Gcc
Snapshot gcc-12-20210704 is now available on
  https://gcc.gnu.org/pub/gcc/snapshots/12-20210704/
and on various mirrors, see http://gcc.gnu.org/mirrors.html for details.

This snapshot has been generated from the GCC 12 git branch
with the following options: git://gcc.gnu.org/git/gcc.git branch master 
revision d07092a61d5a6907b2d92563e810bf5bb8e61c01

You'll find:

 gcc-12-20210704.tar.xz   Complete GCC

  SHA256=98b08d4e3b0706ca642e35f8b57ba72fb3736d5cdf7f2494c8618e30ed743830
  SHA1=b9d3525db30621bb0af9f514e69bca784a28bc3f

Diffs from 12-20210627 are available in the diffs/ subdirectory.

When a particular snapshot is ready for public consumption the LATEST-12
link is updated and a message is sent to the gcc list.  Please do not use
a snapshot before it has been announced that way.


Re: Question on tree LIM

2021-07-04 Thread Kewen.Lin via Gcc
on 2021/7/2 下午7:28, Richard Biener wrote:
> On Fri, Jul 2, 2021 at 11:05 AM Kewen.Lin  wrote:
>>
>> Hi Richard,
>>
>> on 2021/7/2 下午4:07, Richard Biener wrote:
>>> On Fri, Jul 2, 2021 at 5:34 AM Kewen.Lin via Gcc  wrote:

 Hi,

 I am investigating one degradation related to SPEC2017 exchange2_r,
 with loop vectorization on at -O2, it degraded by 6%.  By some
 isolation, I found it isn't directly caused by vectorization itself,
 but exposed by vectorization, some stuffs for vectorization
 condition checks are hoisted out and they increase the register
 pressure, finally results in more spillings than before.  If I simply
 disable tree lim4, I can see the gap becomes smaller (just 40%+ of
 the original), if further disable rtl lim, it just becomes to 30% of
 the original.  It seems to indicate there is some room to improve in
 both LIMs.

 By quick scanning in tree LIM, I noticed that there seems no any
 considerations on register pressure, it looked intentional? I am
 wondering what's the design philosophy behind it?  Is it because that
 it's hard to model register pressure well here?  If so, it seems to
 put the burden onto late RA, which needs to have a good
 rematerialization support.
>>>
>>> Yes, it is "intentional" in that doing any kind of prioritization based
>>> on register pressure is hard on the GIMPLE level since most
>>> high-level transforms try to expose followup transforms which you'd
>>> somehow have to anticipate.  Note that LIMs "cost model" (if you can
>>> call it such...) is too simplistic to be a good base to decide which
>>> 10 of the 20 candidates you want to move (and I've repeatedly pondered
>>> to remove it completely).
>>>
>>
>> Thanks for the explanation!  Do you really want to remove it completely
>> rather than just improve it with a better one?  :-\
> 
> ;)  For example the LIM cost model makes it not hoist an invariant (int)x
> but then PRE which detects invariant motion opportunities as partial
> redundances happily does (because PRE has no cost model at all - heh).
> 

Got it, thanks for further clarification. :)

>> Here there are some PRs (PR96825, PR98782) related to exchange2_r which
>> seems to suffer from high register pressure and bad spillings.  Not sure
>> whether they are also somehow related to the pressure given from LIM, but
>> the trigger is commit
>> 1118a3ff9d3ad6a64bba25dc01e7703325e23d92 which adjusts prediction
>> frequency, maybe it's worth to re-visiting this idea about considering
>> BB frequency in LIM cost model:
>> https://gcc.gnu.org/pipermail/gcc/2014-November/215551.html
> 
> Note most "problems", and those which are harder to undo, stem from
> LIMs store-motion which increases register pressure inside loops by
> adding loop-carried dependences.  The BB frequency might be a way
> to order candidates when we have a way to set a better cap on the
> number of refs to move.  Note the current "cost" model is rather a
> benefit model and causes us to not move cheap things (like the above
> conversion) because it seems not worth the trouble.
> 

Yeah, I noticed it at least excludes "cheap" ones.

> Note a very simple way would be to have a --param specifying a
> maximum number of refs to move (but note there are several
> LIM/store-motion passes so any such static limit would have
> surprising effects).  For store-motion I considered a hard limit on
> the number of loop carried dependences (PHIs) and counting both
> existing and added ones (to avoid the surprise).
> 
> Note how such limits or other cost models should consider inner and
> outer loop behavior remains to be determined - at least LIM works
> at the level of whole loop nests and there's a rough idea of dependent
> transforms but simply gathering candidates and stripping some isn't
> going to work without major surgery in that area I think.
> 

Thanks for all the notes and thoughts, I might had better to visit RA remat
first, Xionghu had some interests to investigate how to consider BB freq in
LIMs, I will check its effect and further check these ideas if need then.

BR,
Kewen

>>> As to putting the burden on RA - yes, that's one possibility.  The other
>>> possibility is to use the register-pressure aware scheduler, though not
>>> sure if that will ever move things into loop bodies.
>>>
>>
>> Brandly new idea!  IIUC it requires a global scheduler, not sure how well
>> GCC global scheduler performs, generally speaking the register-pressure
>> aware scheduler will prefer the insn which has more deads (for that
>> intensive regclass), for this problem the modeling seems a bit different,
>> it has to care about total interference numbers between two "equivalent"
>> blocks (src/dest), not sure if it's easier to do than rematerialization.
> 
> No idea either but as said above undoing store-motion is harder than
> scheduling or RA remat.
> 
 btw, the example loop is at line 1150 from src exchange2.fppized.f90

>