Re: If you had a month to improve gcc build parallelization, where would you begin?

2013-04-10 Thread David Brown
On 10/04/13 04:51, Geert Bosch wrote:
> 
> On Apr 9, 2013, at 22:19, Segher Boessenkool  
> wrote:
> 
>> Some numbers, 16-core 64-thread POWER7, c,c++,fortran bootstrap:
>> -j6:  real57m32.245s
>> -j60: real38m18.583s
> 
> Yes, these confirm mine. It doesn't make sense to look at more
> parallelization before we address the serial bottlenecks.
> The the -j6 parallelism level is about where current laptops
> are. Having a big machine doesn't do as much as having fewer,
> but faster cores. 
> 
> We should be able to do far better. I don't know how the Power7
> threads compare in terms of CPU throughput, but going from -j6 to
> -j48 on our 48-core AMD system should easily yield a 6x speed up
> as all are full cores, but we get similar limited improvements to
> yours, and we get almost perfect scaling in many test suite runs
> that are dominated by compilations.
> 
> The two obvious issues:
>   1. large sequential chains of compiling/running genattrtab followied
>  by compiling insn-attrtab.c and linking the compiler
>   2. repeated serial configure steps
> 
> For 1. we need to somehow split the file up in smaller chunks.
> For 2. we need to have efficient caching.
> 
> Neither is easy...
> 
>   -Geert
> 
> 

Efficient caching of configuration steps would have many more benefits
than just improving parallel builds.  It would directly improve builds
on /all/ systems, and have an even bigger effect on re-builds (re-build
compilation speed can be improved with ccache - but you still need to
wait for the ./configure steps).

And of course if someone figures out how to speed up ./configure with
gcc, the same technique could probably be used on thousands of other
programs.

I presume fixing the slow, inefficient serial nature of autotools is a
hard job - otherwise it would have been done already.  But any
improvements there could have big potential benefits.




Re: If you had a month to improve gcc build parallelization, where would you begin?

2013-04-10 Thread Richard Biener
On Wed, Apr 10, 2013 at 4:51 AM, Geert Bosch  wrote:
>
> On Apr 9, 2013, at 22:19, Segher Boessenkool  
> wrote:
>
>> Some numbers, 16-core 64-thread POWER7, c,c++,fortran bootstrap:
>> -j6:  real57m32.245s
>> -j60: real38m18.583s
>
> Yes, these confirm mine. It doesn't make sense to look at more
> parallelization before we address the serial bottlenecks.
> The the -j6 parallelism level is about where current laptops
> are. Having a big machine doesn't do as much as having fewer,
> but faster cores.

Indeed having 4-6 cores with the possibility of serial stages letting
the CPU boost the only active cores frequency as much as possible
leads to the fastest _bootstrap_ times.  Now, testing is a different
story here - and testing dominates my bootstrap / regtest throughput.

So the idea of a combined bootstrap & test make target that would
interleaves stage3 target library building, target library
testing and compiler testing (harder due to some target library dependencies
of certain .exp files) would already help (thinking of how libjava
building is the bootleneck for a --enable-languages=all build).

Similar allowing to build multilibs in parallell.

Richard.


Question about Os, O2 and duplicating code

2013-04-10 Thread Konstantin Vladimirov
Hi,

I have this problem in private backend, but it is reproducible on
x86-gcc also, so I suppose core GCC probems. Lets consider simple
example:

unsigned int buffer[10];

__attribute__((noinline)) void
myFunc(unsigned int a, unsigned int b, unsigned int c)
{
  unsigned int tmp;
  if( a & 0x2 )
{
  tmp = 0x3221;
  tmp |= (a & 0xF) << 24;
  tmp |= (a & 0x3) << 2;
}
  else
{
  tmp = 0x83621;
  tmp |= (a & 0xF) << 24;
  tmp |= (a & 0x3) << 2;
}
  buffer[0] = tmp;
}

And compile it with -Os to assembler. It yields:

movl %edi, %eax
andl $15, %eax
sall $24, %eax
testb $2, %dil
je .L2
andl $3, %edi
sall $2, %edi
orl %edi, %eax
orl $12833, %eax
jmp .L3
.L2:
andl $3, %edi
sall $2, %edi
orl %edi, %eax
orl $538145, %eax
.L3:
movl %eax, buffer(%rip)
ret

There is a big common code fragment:

andl $3, %edi
sall $2, %edi
orl %edi, %eax

That potentially can be moved upper and reduce code size. But it
doesn't. Also on O2.

Things are even worse in my backend, because my target have
conditional or and all code possibly may be linearized in this case
with good performance impact, and only thing why not is because GCC
fails to move common part out of the if-else switch upper prior to
trying conditional execution.

Can someone advice me how to tune target machine options to signal,
that moving common parts out of the conditional expressions is
profitable? Or the only way is to write custom pass?

---
With best regards, Konstantin


Re: Question about Os, O2 and duplicating code

2013-04-10 Thread Richard Biener
On Wed, Apr 10, 2013 at 11:46 AM, Konstantin Vladimirov
 wrote:
> Hi,
>
> I have this problem in private backend, but it is reproducible on
> x86-gcc also, so I suppose core GCC probems. Lets consider simple
> example:
>
> unsigned int buffer[10];
>
> __attribute__((noinline)) void
> myFunc(unsigned int a, unsigned int b, unsigned int c)
> {
>   unsigned int tmp;
>   if( a & 0x2 )
> {
>   tmp = 0x3221;
>   tmp |= (a & 0xF) << 24;
>   tmp |= (a & 0x3) << 2;
> }
>   else
> {
>   tmp = 0x83621;
>   tmp |= (a & 0xF) << 24;
>   tmp |= (a & 0x3) << 2;
> }
>   buffer[0] = tmp;
> }
>
> And compile it with -Os to assembler. It yields:
>
> movl %edi, %eax
> andl $15, %eax
> sall $24, %eax
> testb $2, %dil
> je .L2
> andl $3, %edi
> sall $2, %edi
> orl %edi, %eax
> orl $12833, %eax
> jmp .L3
> .L2:
> andl $3, %edi
> sall $2, %edi
> orl %edi, %eax
> orl $538145, %eax
> .L3:
> movl %eax, buffer(%rip)
> ret
>
> There is a big common code fragment:
>
> andl $3, %edi
> sall $2, %edi
> orl %edi, %eax
>
> That potentially can be moved upper and reduce code size. But it
> doesn't. Also on O2.
>
> Things are even worse in my backend, because my target have
> conditional or and all code possibly may be linearized in this case
> with good performance impact, and only thing why not is because GCC
> fails to move common part out of the if-else switch upper prior to
> trying conditional execution.
>
> Can someone advice me how to tune target machine options to signal,
> that moving common parts out of the conditional expressions is
> profitable? Or the only way is to write custom pass?

Apart from RTL cross-jumping which is quite limited there is no existing
pass that does this transformation.  There is PR23286 for this which
also has some patches.

Richard.

> ---
> With best regards, Konstantin


Re: If you had a month to improve gcc build parallelization, where would you begin?

2013-04-10 Thread Joern Rennecke

Quoting David Brown :


I presume fixing the slow, inefficient serial nature of autotools is a
hard job - otherwise it would have been done already.  But any
improvements there could have big potential benefits.


I think it's more likely because it is tedious and time-consuming to do,
and is not really high enough on any one person's priority list.

The likelyhood that a test depends on the outcome of the last few n tests
is rather low.  So you could tests speculatively with an incomplete set of
defines, and then re-run them when you have gathered all the results from
the preceding tests to verify that you have computed the right result.
Backtrack if necessary.

Also, some parts of the autotools could probably be sped up if you used
compiled C programs instead of shell scripts and m4.  Particularily if
we add test result sync / merge tasks as outlined above. This compiled
autotools building might in turn need some old-style autoconf, and only
be suitable for a subset of systems, but with the heavy usage of autotools
that we have today, the build cost would be quickly amortized on a typical
developer's machine.
Reminds me a bit of the evolution of the Usenet news software C-News.


Re: If you had a month to improve gcc build parallelization, where would you begin?

2013-04-10 Thread NightStrike
On Tue, Apr 9, 2013 at 9:50 PM, David Brown  wrote:
> And of course if someone figures out how to speed up ./configure with
> gcc, the same technique could probably be used on thousands of other
> programs.

Is gcc already using a new enough autoconf (2.64+) to take advantage
of shell functions?  That has been a big speedup in general with
configure.  I know it's more of an evolutionary rather than
revolutionary change, but every bit helps.


Re: If you had a month to improve gcc build parallelization, where would you begin?

2013-04-10 Thread Joel Sherrill

On 4/10/2013 2:52 AM, Richard Biener wrote:

On Wed, Apr 10, 2013 at 4:51 AM, Geert Bosch  wrote:

On Apr 9, 2013, at 22:19, Segher Boessenkool  wrote:


Some numbers, 16-core 64-thread POWER7, c,c++,fortran bootstrap:
-j6:  real57m32.245s
-j60: real38m18.583s

Yes, these confirm mine. It doesn't make sense to look at more
parallelization before we address the serial bottlenecks.
The the -j6 parallelism level is about where current laptops
are. Having a big machine doesn't do as much as having fewer,
but faster cores.

Indeed having 4-6 cores with the possibility of serial stages letting
the CPU boost the only active cores frequency as much as possible
leads to the fastest _bootstrap_ times.  Now, testing is a different
story here - and testing dominates my bootstrap / regtest throughput.

I have side-stepped the entire -jN issue by using GNU parallel and running
build/test of multiple targets in parallel with each build/test at -j1. 
This

doesn't help the per target build time but since we build around 15 targets,
this is an improvement overall.

My rule of thumb is that for builds 1.5-2 times the number of cores is
usually enough to keep a  computer occupied but not overwhelmed. But
if you are including running tests on simulators, the cores are heavily
utilized during the test runs so no more than 1.5 times the number of
cores.

So the idea of a combined bootstrap & test make target that would
interleaves stage3 target library building, target library
testing and compiler testing (harder due to some target library dependencies
of certain .exp files) would already help (thinking of how libjava
building is the bootleneck for a --enable-languages=all build).

Similar allowing to build multilibs in parallell.

This is a big one for RTEMS targets. powerpc-rtems takes ~1hr at -j1
to build binutils, gcc/newlib, and gdb on a 2.9Ghz i7 without testing.
Most of our targets are around 25min.  As the number of multilibs
goes up, there are large variations.

Richard.



--
Joel Sherrill, Ph.D. Director of Research & Development
joel.sherr...@oarcorp.comOn-Line Applications Research
Ask me about RTEMS: a free RTOS  Huntsville AL 35805
Support Available(256) 722-9985



Google Summer of Code Webcast

2013-04-10 Thread Joel Sherrill

Hi

For those interested in participating in the Google Summer of
Code, the Google Open Source Program Office is holding a video
conference using Google Hangouts tomorrow which should
be archived on YouTube afterwards.

https://plus.google.com/events/cpooa4srhdkp7o6ttsu88tcuhl4

This is a good opportunity to learn more about the program
in general.

--
Joel Sherrill, Ph.D. Director of Research & Development
joel.sherr...@oarcorp.comOn-Line Applications Research
Ask me about RTEMS: a free RTOS  Huntsville AL 35805
Support Available(256) 722-9985



Re: If you had a month to improve gcc build parallelization, where would you begin?

2013-04-10 Thread Tom Tromey
> "Joern" == Joern Rennecke  writes:

Joern> The likelyhood that a test depends on the outcome of the last few
Joern> n tests is rather low.  So you could tests speculatively with an
Joern> incomplete set of defines, and then re-run them when you have
Joern> gathered all the results from the preceding tests to verify that
Joern> you have computed the right result.

I think there are things that can be parallelized without needing to do
any speculation.  For example, AC_CHECK_HEADERS is often invoked with
many header files.  These could in most cases be checked for in
parallel.  Similarly for AC_CHECK_FUNCS.

Beyond that you could define a general way to run checks in parallel and
then just change the gcc configure script to use it, using knowledge of
the code to decide what dependencies there are.

Whether or not this would yield a big enough benefit, though ...

I had the vague impression that this was being looked at in upstream
autoconf, but I don't actually follow it any more.

Tom