> On Thu, 11 Apr 2019, Jan Hubicka wrote: > > > Hi, > > the LTO streaming forks for every partition. With the number of > > partitions incrased to 128 and relatively large memory usage (around > > 5GB) needed to WPA firefox this causes kernel to spend a lot of time > > probably by copying the page tables. > > > > This patch makes the streamer to for only lto_parallelism times > > and strem num_partitions/lto_paralleism in each thread. > > I have also added parameter because currently -flto=jobserv leads > > to unlimited parallelism. This should be fixed by conneting to Make's > > jobsever and build our own mini jobserver to distribute partitions > > between worker threads, but this seems bit too involved for last minute > > change in stage4. I plan to work on this and hopefully bacport it to .2 > > release. > > > > I have tested the performance on by 32CPU 64threads box and got best > > wall time with 32 partitions and therefore I set it by default. I get > > > > --param max-lto-streaming-parallelism=1 > > Time variable usr sys > > wall GGC > > phase stream out : 50.65 ( 30%) 20.66 ( 61%) 71.38 ( > > 35%) 921 kB ( 0%) > > TOTAL : 170.73 33.69 204.64 > > 7459610 kB > > > > --param max-lto-streaming-parallelism=4 > > phase stream out : 13.79 ( 11%) 6.80 ( 35%) 20.94 ( > > 14%) 155 kB ( 0%) > > TOTAL : 130.26 19.68 150.46 > > 7458844 kB > > > > --param max-lto-streaming-parallelism=8 > > phase stream out : 8.94 ( 7%) 5.21 ( 29%) 14.15 ( > > 10%) 83 kB ( 0%) > > TOTAL : 125.28 18.09 143.54 > > 7458773 kB > > > > --param max-lto-streaming-parallelism=16 > > phase stream out : 4.56 ( 4%) 4.34 ( 25%) 9.46 ( > > 7%) 35 kB ( 0%) > > TOTAL : 122.60 17.21 140.56 > > 7458725 kB > > > > --param max-lto-streaming-parallelism=32 > > phase stream out : 2.34 ( 2%) 5.69 ( 31%) 8.03 ( > > 6%) 15 kB ( 0%) > > TOTAL : 118.53 18.36 137.08 > > 7458705 kB > > > > --param max-lto-streaming-parallelism=64 > > phase stream out : 1.63 ( 1%) 15.76 ( 55%) 17.40 ( > > 12%) 13 kB ( 0%) > > TOTAL : 122.17 28.66 151.00 > > 7458702 kB > > > > --param max-lto-streaming-parallelism=256 > > phase stream out : 1.28 ( 1%) 9.24 ( 41%) 10.53 ( > > 8%) 13 kB ( 0%) > > TOTAL : 116.78 22.56 139.53 > > 7458702 kB > > > > Note that it is bit odd that 64 leads to worse results that full > > parallelism but it seems to reproduce relatively well. Also the usr/sys > > times for streaming are not representative since they do not account sys > > time of the forked threads. I am not sure where the fork time is > > accounted. > > > > Generally it seems that the forking performance is not at all that > > bad and scales reasonably, but I still we should limit the default for > > something less than 128 we do now. Definitly there are diminishing > > returns after increasing from 16 or 32 and memory use goes up > > noticeably. With current trunk memory use also does not seem terribly > > bad (less global stream streaming makes the workers cheaper) and in all > > memory traces I collected it is dominated by compilation stage during > > the full rebuild. > > > > I did similar tests for cc1 binary. There the relative time spent in > > streaming is lower so it goes from 17% to 1% (for parallelism 1 and 32 > > respectively) > > > > Bootstrapped/regtested x86_64-linux, OK? > > Please document the new param in invoke.texi. Otherwise looks good > to me. Btw, do we actually allocate garbage at write-out time? > Thus, would using threads work as well?
It is on my TODO to get this working. Last time i checked by adding abort into ggc_alloc there was some occurences but I think that can be cleanded up. I wonder how much performance hit we would get for enabling pthreads for lto1 binary and thus building libbackend with it? Honza