> On Thu, 11 Apr 2019, Jan Hubicka wrote:
> 
> > Hi,
> > the LTO streaming forks for every partition. With the number of
> > partitions incrased to 128 and relatively large memory usage (around
> > 5GB) needed to WPA firefox this causes kernel to spend a lot of time
> > probably by copying the page tables.
> > 
> > This patch makes the streamer to for only lto_parallelism times
> > and strem num_partitions/lto_paralleism in each thread.
> > I have also added parameter because currently -flto=jobserv leads
> > to unlimited parallelism.  This should be fixed by conneting to Make's
> > jobsever and build our own mini jobserver to distribute partitions
> > between worker threads, but this seems bit too involved for last minute
> > change in stage4.  I plan to work on this and hopefully bacport it to .2
> > release.
> > 
> > I have tested the performance on by 32CPU 64threads box and got best
> > wall time with 32 partitions and therefore I set it by default.  I get
> > 
> > --param max-lto-streaming-parallelism=1
> > Time variable                                   usr           sys          
> > wall               GGC
> >  phase stream out                   :  50.65 ( 30%)  20.66 ( 61%)  71.38 ( 
> > 35%)     921 kB (  0%)
> >  TOTAL                              : 170.73         33.69        204.64    
> >     7459610 kB
> > 
> > --param max-lto-streaming-parallelism=4
> >  phase stream out                   :  13.79 ( 11%)   6.80 ( 35%)  20.94 ( 
> > 14%)     155 kB (  0%)
> >  TOTAL                              : 130.26         19.68        150.46    
> >     7458844 kB
> > 
> > --param max-lto-streaming-parallelism=8
> >  phase stream out                   :   8.94 (  7%)   5.21 ( 29%)  14.15 ( 
> > 10%)      83 kB (  0%)
> >  TOTAL                              : 125.28         18.09        143.54    
> >     7458773 kB
> > 
> > --param max-lto-streaming-parallelism=16
> >  phase stream out                   :   4.56 (  4%)   4.34 ( 25%)   9.46 (  
> > 7%)      35 kB (  0%)
> >  TOTAL                              : 122.60         17.21        140.56    
> >     7458725 kB
> > 
> > --param max-lto-streaming-parallelism=32
> >  phase stream out                   :   2.34 (  2%)   5.69 ( 31%)   8.03 (  
> > 6%)      15 kB (  0%)
> >  TOTAL                              : 118.53         18.36        137.08    
> >     7458705 kB
> > 
> > --param max-lto-streaming-parallelism=64
> >  phase stream out                   :   1.63 (  1%)  15.76 ( 55%)  17.40 ( 
> > 12%)      13 kB (  0%)
> >  TOTAL                              : 122.17         28.66        151.00    
> >     7458702 kB
> > 
> > --param max-lto-streaming-parallelism=256
> >  phase stream out                   :   1.28 (  1%)   9.24 ( 41%)  10.53 (  
> > 8%)      13 kB (  0%)
> >  TOTAL                              : 116.78         22.56        139.53    
> >     7458702 kB
> > 
> > Note that it is bit odd that 64 leads to worse results that full
> > parallelism but it seems to reproduce relatively well. Also the usr/sys
> > times for streaming are not representative since they do not account sys
> > time of the forked threads. I am not sure where the fork time is
> > accounted.
> > 
> > Generally it seems that the forking performance is not at all that
> > bad and scales reasonably, but I still we should limit the default for
> > something less than 128 we do now. Definitly there are diminishing
> > returns after increasing from 16 or 32 and memory use goes up
> > noticeably. With current trunk memory use also does not seem terribly
> > bad (less global stream streaming makes the workers cheaper) and in all
> > memory traces I collected it is dominated by compilation stage during
> > the full rebuild.
> > 
> > I did similar tests for cc1 binary. There the relative time spent in
> > streaming is lower so it goes from 17% to 1% (for parallelism 1 and 32
> > respectively)
> > 
> > Bootstrapped/regtested x86_64-linux, OK?
> 
> Please document the new param in invoke.texi.  Otherwise looks good
> to me.  Btw, do we actually allocate garbage at write-out time?
> Thus, would using threads work as well?

It is on my TODO to get this working.  Last time i checked by adding
abort into ggc_alloc there was some occurences but I think that can be
cleanded up.

I wonder how much performance hit we would get for enabling pthreads for
lto1 binary and thus building libbackend with it?

Honza

Reply via email to