Re: If you had a month to improve gcc build parallelization, where would you begin?
On 10/04/13 04:51, Geert Bosch wrote: > > On Apr 9, 2013, at 22:19, Segher Boessenkool > wrote: > >> Some numbers, 16-core 64-thread POWER7, c,c++,fortran bootstrap: >> -j6: real57m32.245s >> -j60: real38m18.583s > > Yes, these confirm mine. It doesn't make sense to look at more > parallelization before we address the serial bottlenecks. > The the -j6 parallelism level is about where current laptops > are. Having a big machine doesn't do as much as having fewer, > but faster cores. > > We should be able to do far better. I don't know how the Power7 > threads compare in terms of CPU throughput, but going from -j6 to > -j48 on our 48-core AMD system should easily yield a 6x speed up > as all are full cores, but we get similar limited improvements to > yours, and we get almost perfect scaling in many test suite runs > that are dominated by compilations. > > The two obvious issues: > 1. large sequential chains of compiling/running genattrtab followied > by compiling insn-attrtab.c and linking the compiler > 2. repeated serial configure steps > > For 1. we need to somehow split the file up in smaller chunks. > For 2. we need to have efficient caching. > > Neither is easy... > > -Geert > > Efficient caching of configuration steps would have many more benefits than just improving parallel builds. It would directly improve builds on /all/ systems, and have an even bigger effect on re-builds (re-build compilation speed can be improved with ccache - but you still need to wait for the ./configure steps). And of course if someone figures out how to speed up ./configure with gcc, the same technique could probably be used on thousands of other programs. I presume fixing the slow, inefficient serial nature of autotools is a hard job - otherwise it would have been done already. But any improvements there could have big potential benefits.
Re: If you had a month to improve gcc build parallelization, where would you begin?
On Wed, Apr 10, 2013 at 4:51 AM, Geert Bosch wrote: > > On Apr 9, 2013, at 22:19, Segher Boessenkool > wrote: > >> Some numbers, 16-core 64-thread POWER7, c,c++,fortran bootstrap: >> -j6: real57m32.245s >> -j60: real38m18.583s > > Yes, these confirm mine. It doesn't make sense to look at more > parallelization before we address the serial bottlenecks. > The the -j6 parallelism level is about where current laptops > are. Having a big machine doesn't do as much as having fewer, > but faster cores. Indeed having 4-6 cores with the possibility of serial stages letting the CPU boost the only active cores frequency as much as possible leads to the fastest _bootstrap_ times. Now, testing is a different story here - and testing dominates my bootstrap / regtest throughput. So the idea of a combined bootstrap & test make target that would interleaves stage3 target library building, target library testing and compiler testing (harder due to some target library dependencies of certain .exp files) would already help (thinking of how libjava building is the bootleneck for a --enable-languages=all build). Similar allowing to build multilibs in parallell. Richard.
Question about Os, O2 and duplicating code
Hi, I have this problem in private backend, but it is reproducible on x86-gcc also, so I suppose core GCC probems. Lets consider simple example: unsigned int buffer[10]; __attribute__((noinline)) void myFunc(unsigned int a, unsigned int b, unsigned int c) { unsigned int tmp; if( a & 0x2 ) { tmp = 0x3221; tmp |= (a & 0xF) << 24; tmp |= (a & 0x3) << 2; } else { tmp = 0x83621; tmp |= (a & 0xF) << 24; tmp |= (a & 0x3) << 2; } buffer[0] = tmp; } And compile it with -Os to assembler. It yields: movl %edi, %eax andl $15, %eax sall $24, %eax testb $2, %dil je .L2 andl $3, %edi sall $2, %edi orl %edi, %eax orl $12833, %eax jmp .L3 .L2: andl $3, %edi sall $2, %edi orl %edi, %eax orl $538145, %eax .L3: movl %eax, buffer(%rip) ret There is a big common code fragment: andl $3, %edi sall $2, %edi orl %edi, %eax That potentially can be moved upper and reduce code size. But it doesn't. Also on O2. Things are even worse in my backend, because my target have conditional or and all code possibly may be linearized in this case with good performance impact, and only thing why not is because GCC fails to move common part out of the if-else switch upper prior to trying conditional execution. Can someone advice me how to tune target machine options to signal, that moving common parts out of the conditional expressions is profitable? Or the only way is to write custom pass? --- With best regards, Konstantin
Re: Question about Os, O2 and duplicating code
On Wed, Apr 10, 2013 at 11:46 AM, Konstantin Vladimirov wrote: > Hi, > > I have this problem in private backend, but it is reproducible on > x86-gcc also, so I suppose core GCC probems. Lets consider simple > example: > > unsigned int buffer[10]; > > __attribute__((noinline)) void > myFunc(unsigned int a, unsigned int b, unsigned int c) > { > unsigned int tmp; > if( a & 0x2 ) > { > tmp = 0x3221; > tmp |= (a & 0xF) << 24; > tmp |= (a & 0x3) << 2; > } > else > { > tmp = 0x83621; > tmp |= (a & 0xF) << 24; > tmp |= (a & 0x3) << 2; > } > buffer[0] = tmp; > } > > And compile it with -Os to assembler. It yields: > > movl %edi, %eax > andl $15, %eax > sall $24, %eax > testb $2, %dil > je .L2 > andl $3, %edi > sall $2, %edi > orl %edi, %eax > orl $12833, %eax > jmp .L3 > .L2: > andl $3, %edi > sall $2, %edi > orl %edi, %eax > orl $538145, %eax > .L3: > movl %eax, buffer(%rip) > ret > > There is a big common code fragment: > > andl $3, %edi > sall $2, %edi > orl %edi, %eax > > That potentially can be moved upper and reduce code size. But it > doesn't. Also on O2. > > Things are even worse in my backend, because my target have > conditional or and all code possibly may be linearized in this case > with good performance impact, and only thing why not is because GCC > fails to move common part out of the if-else switch upper prior to > trying conditional execution. > > Can someone advice me how to tune target machine options to signal, > that moving common parts out of the conditional expressions is > profitable? Or the only way is to write custom pass? Apart from RTL cross-jumping which is quite limited there is no existing pass that does this transformation. There is PR23286 for this which also has some patches. Richard. > --- > With best regards, Konstantin
Re: If you had a month to improve gcc build parallelization, where would you begin?
Quoting David Brown : I presume fixing the slow, inefficient serial nature of autotools is a hard job - otherwise it would have been done already. But any improvements there could have big potential benefits. I think it's more likely because it is tedious and time-consuming to do, and is not really high enough on any one person's priority list. The likelyhood that a test depends on the outcome of the last few n tests is rather low. So you could tests speculatively with an incomplete set of defines, and then re-run them when you have gathered all the results from the preceding tests to verify that you have computed the right result. Backtrack if necessary. Also, some parts of the autotools could probably be sped up if you used compiled C programs instead of shell scripts and m4. Particularily if we add test result sync / merge tasks as outlined above. This compiled autotools building might in turn need some old-style autoconf, and only be suitable for a subset of systems, but with the heavy usage of autotools that we have today, the build cost would be quickly amortized on a typical developer's machine. Reminds me a bit of the evolution of the Usenet news software C-News.
Re: If you had a month to improve gcc build parallelization, where would you begin?
On Tue, Apr 9, 2013 at 9:50 PM, David Brown wrote: > And of course if someone figures out how to speed up ./configure with > gcc, the same technique could probably be used on thousands of other > programs. Is gcc already using a new enough autoconf (2.64+) to take advantage of shell functions? That has been a big speedup in general with configure. I know it's more of an evolutionary rather than revolutionary change, but every bit helps.
Re: If you had a month to improve gcc build parallelization, where would you begin?
On 4/10/2013 2:52 AM, Richard Biener wrote: On Wed, Apr 10, 2013 at 4:51 AM, Geert Bosch wrote: On Apr 9, 2013, at 22:19, Segher Boessenkool wrote: Some numbers, 16-core 64-thread POWER7, c,c++,fortran bootstrap: -j6: real57m32.245s -j60: real38m18.583s Yes, these confirm mine. It doesn't make sense to look at more parallelization before we address the serial bottlenecks. The the -j6 parallelism level is about where current laptops are. Having a big machine doesn't do as much as having fewer, but faster cores. Indeed having 4-6 cores with the possibility of serial stages letting the CPU boost the only active cores frequency as much as possible leads to the fastest _bootstrap_ times. Now, testing is a different story here - and testing dominates my bootstrap / regtest throughput. I have side-stepped the entire -jN issue by using GNU parallel and running build/test of multiple targets in parallel with each build/test at -j1. This doesn't help the per target build time but since we build around 15 targets, this is an improvement overall. My rule of thumb is that for builds 1.5-2 times the number of cores is usually enough to keep a computer occupied but not overwhelmed. But if you are including running tests on simulators, the cores are heavily utilized during the test runs so no more than 1.5 times the number of cores. So the idea of a combined bootstrap & test make target that would interleaves stage3 target library building, target library testing and compiler testing (harder due to some target library dependencies of certain .exp files) would already help (thinking of how libjava building is the bootleneck for a --enable-languages=all build). Similar allowing to build multilibs in parallell. This is a big one for RTEMS targets. powerpc-rtems takes ~1hr at -j1 to build binutils, gcc/newlib, and gdb on a 2.9Ghz i7 without testing. Most of our targets are around 25min. As the number of multilibs goes up, there are large variations. Richard. -- Joel Sherrill, Ph.D. Director of Research & Development joel.sherr...@oarcorp.comOn-Line Applications Research Ask me about RTEMS: a free RTOS Huntsville AL 35805 Support Available(256) 722-9985
Google Summer of Code Webcast
Hi For those interested in participating in the Google Summer of Code, the Google Open Source Program Office is holding a video conference using Google Hangouts tomorrow which should be archived on YouTube afterwards. https://plus.google.com/events/cpooa4srhdkp7o6ttsu88tcuhl4 This is a good opportunity to learn more about the program in general. -- Joel Sherrill, Ph.D. Director of Research & Development joel.sherr...@oarcorp.comOn-Line Applications Research Ask me about RTEMS: a free RTOS Huntsville AL 35805 Support Available(256) 722-9985
Re: If you had a month to improve gcc build parallelization, where would you begin?
> "Joern" == Joern Rennecke writes: Joern> The likelyhood that a test depends on the outcome of the last few Joern> n tests is rather low. So you could tests speculatively with an Joern> incomplete set of defines, and then re-run them when you have Joern> gathered all the results from the preceding tests to verify that Joern> you have computed the right result. I think there are things that can be parallelized without needing to do any speculation. For example, AC_CHECK_HEADERS is often invoked with many header files. These could in most cases be checked for in parallel. Similarly for AC_CHECK_FUNCS. Beyond that you could define a general way to run checks in parallel and then just change the gcc configure script to use it, using knowledge of the code to decide what dependencies there are. Whether or not this would yield a big enough benefit, though ... I had the vague impression that this was being looked at in upstream autoconf, but I don't actually follow it any more. Tom