Re: Where did my function go?

2020-10-21 Thread Jan Hubicka
> On Wed, Oct 21, 2020 at 5:21 AM Gary Oblock  wrote:
> >
> > >IPA transforms happens when get_body is called.  With LTO this also
> > >trigger reading the body from disk.  So if you want to see all bodies
> > >and work on them, you can simply call get_body on everything but it will
> > >result in increased memory use since everything will be loaded form disk
> > >and expanded (by inlining) at once instead of doing it on per-function
> > >basis.
> > Jan,
> >
> > Doing
> >
> > FOR_EACH_FUNCTION_WITH_GIMPLE_BODY ( node) node->get_body ();
> >
> > instead of
> >
> > FOR_EACH_FUNCTION_WITH_GIMPLE_BODY ( node) node->get_untransformed_body ();
> >
> > instantaneously breaks everything...
> 
> I think during WPA you cannot do ->get_body (), only
> ->get_untransformed_body ().  But
> we don't know yet where in the IPA process you're experiencing the issue.

Originally get_body is designed to work in WPA as well: the info about
what transforms are to be applied is kept in a vector with per-function
granuality. But there may be some issues as this path is untested and
i.e ipa-sra/ipa-prop does quite difficult transformations these days.
What happens?

Honza
> 
> Richard.
> 
> > Am I missing something?
> >
> > Gary
> > 
> > From: Jan Hubicka 
> > Sent: Tuesday, October 20, 2020 4:34 AM
> > To: Richard Biener 
> > Cc: GCC Development ; Gary Oblock 
> > 
> > Subject: Re: Where did my function go?
> >
> > [EXTERNAL EMAIL NOTICE: This email originated from an external sender. 
> > Please be mindful of safe email handling and proprietary information 
> > protection practices.]
> >
> >
> > > > On Tue, Oct 20, 2020 at 1:02 PM Martin Jambor  wrote:
> > > > >
> > > > > Hi,
> > > > >
> > > > > On Tue, Oct 20 2020, Richard Biener wrote:
> > > > > > On Mon, Oct 19, 2020 at 7:52 PM Gary Oblock 
> > > > > >  wrote:
> > > > > >>
> > > > > >> Richard,
> > > > > >>
> > > > > >> I guess that will work for me. However, since it
> > > > > >> was decided to remove an identical function,
> > > > > >> why weren't the calls to it adjusted to reflect it?
> > > > > >> If the call wasn't transformed that means it will
> > > > > >> be mapped at some later time. Is that mapping
> > > > > >> available to look at? Because using that would
> > > > > >> also be a potential solution (assuming call
> > > > > >> graph information exists for the deleted function.)
> > > > > >
> > > > > > I'm not sure how the transitional cgraph looks like
> > > > > > during WPA analysis (which is what we're talking about?),
> > > > > > but definitely the IL is unmodified in that state.
> > > > > >
> > > > > > Maybe Martin has an idea.
> > > > > >
> > > > >
> > > > > Exactly, the cgraph_edges is where the correct call information is
> > > > > stored until the inlining transformation phase calls
> > > > > cgraph_edge::redirect_call_stmt_to_callee is called on it - inlining 
> > > > > is
> > > > > a special pass in this regard that performs this IPA-infrastructure
> > > > > function in addition to actual inlining.
> > > > >
> > > > > In cgraph means the callee itself but also information in
> > > > > e->callee->clone.param_adjustments which might be interesting for any
> > > > > struct-reorg-like optimizations (...and in future possibly in other
> > > > > transformation summaries).
> > > > >
> > > > > The late IPA passes are in very unfortunate spot here since they run
> > > > > before the real-IPA transformation phases but after unreachable node
> > > > > removals and after clone materializations and so can see some but not
> > > > > all of the changes performed by real IPA passes.  The reason for that 
> > > > > is
> > > > > good cache locality when late IPA passes are either not run at all or
> > > > > only look at small portion of the compilation unit.  In such case IPA
> > > > > transformations of a function are followed by all the late passes
> > > > > working on the same function.
> > > > >
> > > > > Late IPA passes are unfortunately second class citizens and I would
> > > > > strongly recommend not to use them since they do not fit into our
> > > > > otherwise robust IPA framework very well.  We could probably provide a
> > > > > mechanism that would allow late IPA passes to run all normal IPA
> > > > > transformations on a function so they could clearly see what they are
> > > > > looking at, but extensive use would slow compilation down so its use
> > > > > would be frowned upon at the very least.
> > > >
> > > > So IPA PTA does get_body () on the nodes it wants to analyze and I
> > > > thought that triggers any pending IPA transforms?
> > >
> > > Yes, it does (and get_untransormed_body does not)
> > And to bit correct Maritn's explanation: the late IPA passes are
> > intended to work, though I was mostly planning them for prototyping true
> > ipa passes and also possibly for implementing passes that inspect only
> > few functions.
> >
> > IPA transforms happens when get_body is called.  With LTO this also
> > trigger reading the b

Re: LTO slows down calculix by more than 10% on aarch64

2020-10-21 Thread Prathamesh Kulkarni via Gcc
On Thu, 24 Sep 2020 at 16:44, Richard Biener  wrote:
>
> On Thu, Sep 24, 2020 at 12:36 PM Prathamesh Kulkarni
>  wrote:
> >
> > On Wed, 23 Sep 2020 at 16:40, Richard Biener  
> > wrote:
> > >
> > > On Wed, Sep 23, 2020 at 12:11 PM Prathamesh Kulkarni
> > >  wrote:
> > > >
> > > > On Wed, 23 Sep 2020 at 13:22, Richard Biener 
> > > >  wrote:
> > > > >
> > > > > On Tue, Sep 22, 2020 at 6:25 PM Prathamesh Kulkarni
> > > > >  wrote:
> > > > > >
> > > > > > On Tue, 22 Sep 2020 at 16:36, Richard Biener 
> > > > > >  wrote:
> > > > > > >
> > > > > > > On Tue, Sep 22, 2020 at 11:37 AM Prathamesh Kulkarni
> > > > > > >  wrote:
> > > > > > > >
> > > > > > > > On Tue, 22 Sep 2020 at 12:56, Richard Biener 
> > > > > > > >  wrote:
> > > > > > > > >
> > > > > > > > > On Tue, Sep 22, 2020 at 7:08 AM Prathamesh Kulkarni
> > > > > > > > >  wrote:
> > > > > > > > > >
> > > > > > > > > > On Mon, 21 Sep 2020 at 18:14, Prathamesh Kulkarni
> > > > > > > > > >  wrote:
> > > > > > > > > > >
> > > > > > > > > > > On Mon, 21 Sep 2020 at 15:19, Prathamesh Kulkarni
> > > > > > > > > > >  wrote:
> > > > > > > > > > > >
> > > > > > > > > > > > On Fri, 4 Sep 2020 at 17:08, Alexander Monakov 
> > > > > > > > > > > >  wrote:
> > > > > > > > > > > > >
> > > > > > > > > > > > > > I obtained perf stat results for following 
> > > > > > > > > > > > > > benchmark runs:
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > -O2:
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > 7856832.692380  task-clock (msec) # 
> > > > > > > > > > > > > >1.000 CPUs utilized
> > > > > > > > > > > > > >   3758   context-switches   
> > > > > > > > > > > > > >#0.000 K/sec
> > > > > > > > > > > > > > 40 cpu-migrations   
> > > > > > > > > > > > > >   #0.000 K/sec
> > > > > > > > > > > > > >  40847  page-faults 
> > > > > > > > > > > > > >   #0.005 K/sec
> > > > > > > > > > > > > >  7856782413676  cycles  
> > > > > > > > > > > > > >  #1.000 GHz
> > > > > > > > > > > > > >  6034510093417  instructions
> > > > > > > > > > > > > >#0.77  insn per cycle
> > > > > > > > > > > > > >   363937274287   branches   
> > > > > > > > > > > > > > #   46.321 M/sec
> > > > > > > > > > > > > >48557110132   branch-misses  
> > > > > > > > > > > > > >   #   13.34% of all branches
> > > > > > > > > > > > >
> > > > > > > > > > > > > (ouch, 2+ hours per run is a lot, collecting a 
> > > > > > > > > > > > > profile over a minute should be
> > > > > > > > > > > > > enough for this kind of code)
> > > > > > > > > > > > >
> > > > > > > > > > > > > > -O2 with orthonl inlined:
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > 8319643.114380  task-clock (msec)   #   
> > > > > > > > > > > > > >  1.000 CPUs utilized
> > > > > > > > > > > > > >   4285   context-switches   
> > > > > > > > > > > > > >   #0.001 K/sec
> > > > > > > > > > > > > > 28 cpu-migrations   
> > > > > > > > > > > > > >  #0.000 K/sec
> > > > > > > > > > > > > >  40843  page-faults 
> > > > > > > > > > > > > >  #0.005 K/sec
> > > > > > > > > > > > > >  8319591038295  cycles  
> > > > > > > > > > > > > > #1.000 GHz
> > > > > > > > > > > > > >  6276338800377  instructions
> > > > > > > > > > > > > >   #0.75  insn per cycle
> > > > > > > > > > > > > >   467400726106   branches   
> > > > > > > > > > > > > >#   56.180 M/sec
> > > > > > > > > > > > > >45986364011branch-misses 
> > > > > > > > > > > > > >  #9.84% of all branches
> > > > > > > > > > > > >
> > > > > > > > > > > > > So +100e9 branches, but +240e9 instructions and 
> > > > > > > > > > > > > +480e9 cycles, probably implying
> > > > > > > > > > > > > that extra instructions are appearing in this loop 
> > > > > > > > > > > > > nest, but not in the innermost
> > > > > > > > > > > > > loop. As a reminder for others, the innermost loop 
> > > > > > > > > > > > > has only 3 iterations.
> > > > > > > > > > > > >
> > > > > > > > > > > > > > -O2 with orthonl inlined and PRE disabled (this 
> > > > > > > > > > > > > > removes the extra branches):
> > > > > > > > > > > > > >
> > > > > > > > > > > > > >8207331.088040  task-clock (msec)   #
> > > > > > > > > > > > > > 1.000 CPUs utilized
> > > > > > > > > > > > > >   2266   context-switches   
> > > > > > > > > > > > > >  #0.000 K/sec
> > > > > > > > > > > > > > 32 cpu-migrations   
> > > > > > > > > > > > > > #0.000 K/sec
> > > > > > > > > > > > > >  40846  page-faults 
> > > > > > > > > > > > > 

Re: LTO slows down calculix by more than 10% on aarch64

2020-10-21 Thread Richard Biener via Gcc
On Wed, Oct 21, 2020 at 12:04 PM Prathamesh Kulkarni
 wrote:
>
> On Thu, 24 Sep 2020 at 16:44, Richard Biener  
> wrote:
> >
> > On Thu, Sep 24, 2020 at 12:36 PM Prathamesh Kulkarni
> >  wrote:
> > >
> > > On Wed, 23 Sep 2020 at 16:40, Richard Biener  
> > > wrote:
> > > >
> > > > On Wed, Sep 23, 2020 at 12:11 PM Prathamesh Kulkarni
> > > >  wrote:
> > > > >
> > > > > On Wed, 23 Sep 2020 at 13:22, Richard Biener 
> > > > >  wrote:
> > > > > >
> > > > > > On Tue, Sep 22, 2020 at 6:25 PM Prathamesh Kulkarni
> > > > > >  wrote:
> > > > > > >
> > > > > > > On Tue, 22 Sep 2020 at 16:36, Richard Biener 
> > > > > > >  wrote:
> > > > > > > >
> > > > > > > > On Tue, Sep 22, 2020 at 11:37 AM Prathamesh Kulkarni
> > > > > > > >  wrote:
> > > > > > > > >
> > > > > > > > > On Tue, 22 Sep 2020 at 12:56, Richard Biener 
> > > > > > > > >  wrote:
> > > > > > > > > >
> > > > > > > > > > On Tue, Sep 22, 2020 at 7:08 AM Prathamesh Kulkarni
> > > > > > > > > >  wrote:
> > > > > > > > > > >
> > > > > > > > > > > On Mon, 21 Sep 2020 at 18:14, Prathamesh Kulkarni
> > > > > > > > > > >  wrote:
> > > > > > > > > > > >
> > > > > > > > > > > > On Mon, 21 Sep 2020 at 15:19, Prathamesh Kulkarni
> > > > > > > > > > > >  wrote:
> > > > > > > > > > > > >
> > > > > > > > > > > > > On Fri, 4 Sep 2020 at 17:08, Alexander Monakov 
> > > > > > > > > > > > >  wrote:
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > > I obtained perf stat results for following 
> > > > > > > > > > > > > > > benchmark runs:
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > -O2:
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > 7856832.692380  task-clock (msec) 
> > > > > > > > > > > > > > > #1.000 CPUs utilized
> > > > > > > > > > > > > > >   3758   context-switches 
> > > > > > > > > > > > > > >  #0.000 K/sec
> > > > > > > > > > > > > > > 40 cpu-migrations 
> > > > > > > > > > > > > > > #0.000 K/sec
> > > > > > > > > > > > > > >  40847  page-faults   
> > > > > > > > > > > > > > > #0.005 K/sec
> > > > > > > > > > > > > > >  7856782413676  cycles
> > > > > > > > > > > > > > >#1.000 GHz
> > > > > > > > > > > > > > >  6034510093417  instructions  
> > > > > > > > > > > > > > >  #0.77  insn per cycle
> > > > > > > > > > > > > > >   363937274287   branches 
> > > > > > > > > > > > > > >   #   46.321 M/sec
> > > > > > > > > > > > > > >48557110132   branch-misses
> > > > > > > > > > > > > > > #   13.34% of all branches
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > (ouch, 2+ hours per run is a lot, collecting a 
> > > > > > > > > > > > > > profile over a minute should be
> > > > > > > > > > > > > > enough for this kind of code)
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > > -O2 with orthonl inlined:
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > 8319643.114380  task-clock (msec)   # 
> > > > > > > > > > > > > > >1.000 CPUs utilized
> > > > > > > > > > > > > > >   4285   context-switches 
> > > > > > > > > > > > > > > #0.001 K/sec
> > > > > > > > > > > > > > > 28 cpu-migrations 
> > > > > > > > > > > > > > >#0.000 K/sec
> > > > > > > > > > > > > > >  40843  page-faults   
> > > > > > > > > > > > > > >#0.005 K/sec
> > > > > > > > > > > > > > >  8319591038295  cycles
> > > > > > > > > > > > > > >   #1.000 GHz
> > > > > > > > > > > > > > >  6276338800377  instructions  
> > > > > > > > > > > > > > > #0.75  insn per cycle
> > > > > > > > > > > > > > >   467400726106   branches 
> > > > > > > > > > > > > > >  #   56.180 M/sec
> > > > > > > > > > > > > > >45986364011branch-misses   
> > > > > > > > > > > > > > >#9.84% of all branches
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > So +100e9 branches, but +240e9 instructions and 
> > > > > > > > > > > > > > +480e9 cycles, probably implying
> > > > > > > > > > > > > > that extra instructions are appearing in this loop 
> > > > > > > > > > > > > > nest, but not in the innermost
> > > > > > > > > > > > > > loop. As a reminder for others, the innermost loop 
> > > > > > > > > > > > > > has only 3 iterations.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > > -O2 with orthonl inlined and PRE disabled (this 
> > > > > > > > > > > > > > > removes the extra branches):
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > >8207331.088040  task-clock (msec)   #
> > > > > > > > > > > > > > > 1.000 CPUs utilized
> > > > > > > > > > > > > > >   2266   context-switches 
> > > > > > > > > > > > > >

The Next GCC/LLVM/RISC-V meetup in China: Hangzhou, Oct 24, 2020

2020-10-21 Thread 吴伟
Hi all,

The Next OSDT (aka HelloLLVM/HelloGCC) meetup in China will happen on Oct
24, 2020.

The location is at Hangzhou.

Everyone interested in GCC/LLVM Toolchain related projects and/or RISC-V is
invited to join.

Event details is at

Chinese Version:
https://github.com/hellogcc/osdt-weekly/blob/master/events/2020-10-24-hangzhou-meetup.md
English Version:
https://github.com/hellogcc/osdt-weekly/blob/master/events/2020-10-24-hangzhou-meetup.en.md

Presentations are welcome :-)

Current Topics:

- Wei Wu - Recent Progress in RISC-V International
- Ningning Shi - Intro of ART OptimizingCompiler
- Weiwei Li - Learning QEMU/RISU
- Free discussion

Looking forward to meeting you!


-- 
Best wishes,
Wei Wu (吴伟)