date:20061111

Re: Threading the compiler

2006-11-11 Thread Mike Stump


On Nov 10, 2006, at 9:08 PM, Geert Bosch wrote:
I'd guess we win more by writing object files directly to disk like  
virtually every other compiler on the planet.


The cost of my assembler is around 1.0% (ppc) to 1.4% (x86) overhead  
as measured with -pipe -O2 on expr.c,.  If it was converted, what  
type of speedup would you expect?


Most of my compilations (on Linux, at least) use close to 100% of  
CPU. Adding more overhead for threading and communication/ 
synchronization can only hurt.


Would you notice if the cost were under 0.1%?  Would you care?

Threading the compiler

2006-11-11 Thread Howard Chu




* /From/: Mike Stump 
* /To/: GCC Development 
* /Date/: Fri, 10 Nov 2006 12:38:07 -0800
* /Subject/: Threading the compiler


We're going to have to think seriously about threading the compiler. 
Intel predicts 80 cores in the near future (5 years). http:// 
hardware.slashdot.org/article.pl?sid=06/09/26/1937237&from=rss To use 
this many cores for a single compile, we have to find ways to split 
the work. The best way, of course is to have make -j80 do that for us, 
this usually results in excellent efficiencies and an ability to use 
as many cores as there are jobs to run. However, for the edit, 
compile, debug cycle of development, utilizing many cores is harder.


You should give make -j80 a try before you dismiss it as not enough. I 
wrote a paper in 1991 "GNU & You: Building a Better World" (a play on 
X11's "make world" invocation) describing using massively parallel 
machines as a compile server and how to write correct parallel Makefiles 
with GNU make. As you say, you get excellent efficiencies from this.


The edit/compile/debug cycle of development isn't going to benefit 
appreciably from a multithreaded compiler. You can only edit one file at 
a time, the debugging stage isn't going to benefit at all. And in 
general, large programs tend to be split into many source files, where 
many parallel invocations of gcc work just fine. The actual time to 
compile a single source module tends to be small.


Before you launch into this idea, you should obtain profile traces that 
show you have any idle CPU cycles in a particular compilation, cycles 
that could be profitably used once a thread scheduler enters the 
picture. Personally, on my dual-core AMD X2, make -j3 works just fine to 
keep both cores above 98% until the build is done, on projects I 
currently maintain.


Back in 1991, make -j20 worked well enough to keep an 8 processor 
Alliant FX busy building X11, in about a twelfth of the time it took to 
build serially, once I'd excised that abominable imake crap and replaced 
it with pure GNU make. (One hour builds down to about 5 minutes, as I 
recall.)


Of course, if your program has fewer than 80 source files you may not 
get 100% utilization out of the machine, but at that point are you 
really going to care?


--
 -- Howard Chu
 Chief Architect, Symas Corp.  http://www.symas.com
 Director, Highland Sunhttp://highlandsun.com/hyc
 OpenLDAP Core Teamhttp://www.openldap.org/project/

Re: strict aliasing question

2006-11-11 Thread Andreas Schwab

Howard Chu <[EMAIL PROTECTED]> writes:

> That's good to know, thanks. But frankly that's braindead to require
> someone to add all these new union declarations all over their code,

There is no need for any union trick in your example.  Just use a
temporary with the correct type, and you have strictly conforming code.

Andreas.

-- 
Andreas Schwab, SuSE Labs, [EMAIL PROTECTED]
SuSE Linux Products GmbH, Maxfeldstraße 5, 90409 Nürnberg, Germany
PGP key fingerprint = 58CA 54C7 6D53 942B 1756  01D3 44D5 214B 8276 4ED5
"And now for something completely different."

Re: Question on tree-nested.c:convert_nl_goto_reference

2006-11-11 Thread Richard Kenner

> I don't know whether there are any functions nested inside nested
> functions which do non-local gotos in the Ada testsuite.

There aren't now.  But there will be plenty when a transformation that changes
raises of exceptions to gotos to a visible exception handler is finished.

I'll use the C test case you wrote to verify that I'm seeing what you saw.

Thanks.

Re: strict aliasing question

2006-11-11 Thread Ross Ridge

Howard Chu wrote:
> extern void getit( void **arg );
> 
> main() {
>union {
>int *foo;
>void *bar;
>} u;
> 
>getit( &u.bar );
>printf("foo: %x\n", *u.foo);
> }

Rask Ingemann Lambertsen wrote:
> As far as I know, memcpy() is the answer:

You don't need a union or memcpy() to convert the pointer types.  You can
solve the "void **" aliasing problem with just a cast:

void *p;
getit(&p);
printf("%d\n", *(int *)p);

This assumes that getit() actually writes to an "int" object and returns
a "void *" pointer to that object.  If it doesn't then you have another
aliasing problem to worry about.  If it writes to the object using some
other known type, then you need two casts to make it safe:

void *p;
getit(&p);
printf("%d\n", (int)*(long *)p);

If writes to the object using an unknown type then you might able to
use memcpy() to get around the aliasing problem, but this assumes you
know that two types are compatable at the bit level:

void *p;
int n;
getit(&p);
memcpy(&n, p, sizeof n);
printf("%d\n", n);

The best solution would be to fix the interface so that it returns the
pointer types it acutally uses.  This would make it typesafe and you
wouldn't need to use any casts.  If you can't fix the interface itself
the next best thing would be to create your own wrappers which put all
the nasty casts in one place:

int sasl_getprop_str(sasl_conn_t *conn, int prop, char const **pvalue)
{
assert(prop == SASL_AUTHUSER || prop == SASL_APPNAME || ...);
void *tmp;
int r = sasl_getprop(conn, prop, &tmp);
if (r == SASL_OK) 
*pvalue = (char const *) tmp;
return r;
}

Unfortuantely, there are aliasing problems in the Cyrus SASL source that
can still come around and bite you once LTO arrives no matter what you
do in your own code.  You might want to see if you can't get them to
change undefined code like this:

*(unsigned **)pvalue = &conn->oparams.maxoutbuf;

into code like this:

*pvalue = (void *) &conn->oparams.maxoutbuf;

Ross Ridge

Re: Threading the compiler

2006-11-11 Thread Paul Brook

> Let's just say, the CPU is doomed.

So you're building consensus for something that is doomed?

> > Seriously thought I don't really understand what sort of response
> > you're expecting.
>
> Just consensus building.

To build a consensus you have to have something for people to agree or 
disagree with.

> > Do you have any justification for aiming for 8x parallelism in this
> > release and 2x increase in parallelism in the next release?
>
> Our standard box we ship today that people do compiles on tends to be
> a 4 way box.  If a released compiler made use of the hardware we ship
> today, it would need to be 4 way.  For us to have had the feature in
> the compiler we ship with those systems, the feature would have had
> to be in gcc-4.0.  Intel has already announced 4 core chips that are
> pin compatible with the 2 core chips.  Their ship date is in 3 days.
> People have already dropped them in our boxes and they have 8 way
> machines, today.  For them to make use of those cores, today, gcc-4.0
> would had to have been 8 way capable.  The rate of increase in cores
> is 2x every 18 months.  gcc releases are about one every 12-18
> months.  By the time I deploy gcc-4.2, I could use 8 way, by the time
> I stop using gcc-4.2, I could make use of 16-32 cores I suspect.  :-(
>
> > Why not just aim for 16x in the first instance?
>
> If 16x is more work than 8x, then I can't yet pony up the work
> required for 16x myself.  If cheap enough, I'll design a system where
> it is just N-way.  Won't know til I start doing code.

4.2 is already frozen for release, and the feature list for 4.3 is pretty much 
fixed at this point. I wouldn't expect any work of this scale to be released 
before gcc4.4. By your own numbers this means you should be aiming for 32x.

> > You mention that "competition is already starting to make
> > progress". Have they found it to be as easy as you imply?
>
> I didn't ask if they found it easy or not.

Do you have any evidence the scheme you're proposing is even feasible?

> > whole-program optimisation and SMP machines have been around for a
> > fair while now, so I'm guessing not.
>
> I don't know of anything that is particularly hard about it, but, if
> you know of bits that are hard, or have pointer to such, I'd be
> interested in it.

You imply you're considering backporting this to 4.2. I'd be amazed if that 
was worthwhile. I'd expect changes to be required in pretty much the whole 
compiler.

Your strategy is built around the assumption that the majority of the work can 
be split into multiple independent chunks of work. There are several fairly 
obvious places where that is hard. eg. the frontend probably needs to process 
the whole file in series because previous declarations effect later code. And 
inter-procedural optimisations (eg. inlining) don't lend themselves to 
splitting on function boundaries.

For other optimisations I'm not convinced there's an easy win compared with 
make -j. You have to make sure those passes don't have any global state, and 
as other people have pointed out garbage collection gets messy.  The compile 
server project did something similar, and that seems to have died.

If you're suggesting it's possible to make minor changes to gcc, and hide all 
the threading bits in a "manager" module then simply I don't believe you. 
Come back when you have a working prototype.

I don't know how much of the memory allocated is global readonly data (ie. 
suitable for sharing between threads). I wouldn't be surprised if it's a 
relatively small fraction.

If you have answers for the above questions, or some sort of feasibility 
study, maybe you could publish them? That would give people something to 
build a consensus on.  So far you've given a suggestion of how we might like 
it to work but no indication of feasibility, level of effort, or where 
problems are likely to occur.

Paul

Re: Threading the compiler

2006-11-11 Thread Daniel Jacobowitz

On Sat, Nov 11, 2006 at 04:16:19PM +, Paul Brook wrote:
> I don't know how much of the memory allocated is global readonly data (ie. 
> suitable for sharing between threads). I wouldn't be surprised if it's a 
> relatively small fraction.

I don't have numbers on global readonly, but in typical compilation
most of the memory allocated is definitely global.  Past a certain
point much of that is probably readonly.  However, it would take some
clever interfaces and discipline to _guarantee_ that any particular
global bit was shareable.

-- 
Daniel Jacobowitz
CodeSourcery

gcc-4.3-20061111 is now available

2006-11-11 Thread gccadmin

Snapshot gcc-4.3-2006 is now available on
  ftp://gcc.gnu.org/pub/gcc/snapshots/4.3-2006/
and on various mirrors, see http://gcc.gnu.org/mirrors.html for details.

This snapshot has been generated from the GCC 4.3 SVN branch
with the following options: svn://gcc.gnu.org/svn/gcc/trunk revision 118701

You'll find:

gcc-4.3-2006.tar.bz2  Complete GCC (includes all of below)

gcc-core-4.3-2006.tar.bz2 C front end and core compiler

gcc-ada-4.3-2006.tar.bz2  Ada front end and runtime

gcc-fortran-4.3-2006.tar.bz2  Fortran front end and runtime

gcc-g++-4.3-2006.tar.bz2  C++ front end and runtime

gcc-java-4.3-2006.tar.bz2 Java front end and runtime

gcc-objc-4.3-2006.tar.bz2 Objective-C front end and runtime

gcc-testsuite-4.3-2006.tar.bz2The GCC testsuite

Diffs from 4.3-20061104 are available in the diffs/ subdirectory.

When a particular snapshot is ready for public consumption the LATEST-4.3
link is updated and a message is sent to the gcc list.  Please do not use
a snapshot before it has been announced that way.

Reducing the size of C++ executables - eliminating malloc

2006-11-11 Thread Michael Eager


GCC 4.1.1 for PowerPC generates a 162K executable for a
minimal program  "int main() { return 0; }".  GCC 3.4.1
generated a 7.2K executable.  Mark Mitchell mentioned the
same problem for ARM and proposed a patch to remove the
reference to malloc in atexit
(http://sourceware.org/ml/newlib/2006/msg00181.html).

There are references to malloc in eh_alloc.c and
unwind-dw2-fde.c.  It looks like these are being
included even when there are no exception handlers.

Any suggestions on how to eliminate the references
to these routines?

--
Michael Eager[EMAIL PROTECTED]
1960 Park Blvd., Palo Alto, CA 94306  650-325-8077

Re: Threading the compiler

2006-11-11 Thread Daniel Berlin



> > whole-program optimisation and SMP machines have been around for a
> > fair while now, so I'm guessing not.
>
> I don't know of anything that is particularly hard about it, but, if
> you know of bits that are hard, or have pointer to such, I'd be
> interested in it.

You imply you're considering backporting this to 4.2. I'd be amazed if that
was worthwhile. I'd expect changes to be required in pretty much the whole
compiler.

Your strategy is built around the assumption that the majority of the work can
be split into multiple independent chunks of work. There are several fairly
obvious places where that is hard. eg. the frontend probably needs to process
the whole file in series because previous declarations effect later code. And
inter-procedural optimisations (eg. inlining) don't lend themselves to
splitting on function boundaries.


Actually, most IPA optimizations parallelize very well.
Pointer analysis, inlining, can all be partitioned in ways that work
can be split into threads.

Mike is actually not saying anything that most people around here
truly disagree with.  We all want to eventually parallelize and
distribute GCC optimizations.  I just don't think we are at the point
where it makes sense to start doing that yet.  Personally, I believe
the time to start thinking about parallelizing about this stuff is
when the problems that make LTO hard (getting rid of all the little
niggles like front ends generating RTL, and doing the hard stuff like
a middle end type system) are solved.Why?. Without solving the
problems that make LTO hard, you are going to hit them in trying to
make the IPA optimizations (or anything else) parallel, because they
are exactly the shared state between functions and global state
problems that GCC has.

In fact, various people  (including me) have been been discussing how
to parallelize and distribute our optimizations for a few months now.

So if you really want to help parallelizing along, the thing to do is
help LTO right now.

I'm happy to commit to parallelizing IPA pointer analysis (which is a
ridiculously parallel problem) once the hard LTO problems are solved.

Before then, I just think we are going to end up with a bunch of hacks
to try to work around our shared state.

--Dan

Re: Threading the compiler

2006-11-11 Thread Michael Eager


Ross Ridge wrote:

Mike Stump writes:

We're going to have to think seriously about threading the compiler. Intel
predicts 80 cores in the near future (5 years). [...] To use this many
cores for a single compile, we have to find ways to split the work. The
best way, of course is to have make -j80 do that for us, this usually
results in excellent efficiencies and an ability to use as many cores
as there are jobs to run.


Umm... those 80 processors that Intel is talking about are more like the
8 coprocessors in the Cell CPU.  


No, the Cell is asymmetrical (vintage 2000) architecture.  Intel & AMD
have announced that they are developing large multi-core symmetric processors.
The timelines I've seen say that the number of cores on each chip will
double every year or two.  Moore's law hasn't stopped.  The number of
gates per chip doubles every 18 months.

--
Michael Eager[EMAIL PROTECTED]
1960 Park Blvd., Palo Alto, CA 94306  650-325-8077

gmp/mpfr and multilib

2006-11-11 Thread Jack Howarth

   Does anyone know how the changes for gcc to require gmp/mpfr will effect the
multilib builds? In the past, gmp/mpfr in gfortran appeared to only be linked 
into
the compiler itself so that a 32-bit/64-bit multilib build on Darwin PPC only
required gmp/mpfr for 32-bit to be installed. Will any of the libraries in
gcc now require gmp/mpfr such that both 32-bit and 64-bit versions of gmp/mpfr
must be installed? If that is the case, will the multilib build look for both a
lipo 32-bit/64-bit combined shared library in $prefix/lib as well as individual
versions in lib and lib64 subdirectories?
 Jack

Re: Threading the compiler

2006-11-11 Thread Michael Eager


Mike Stump wrote:


Thoughts?


Parallelizing GCC is an interesting problem for a couple
reasons:  First, the problem is inherently sequential.
Second, GCC expects that each step in the process happens
in order, one after the other.

Most invocations of GCC are part of a "cluster" of similar
invocations.  If we look at this cluster, rather than at
individual invocations, there may be opportunities for
parallelization.  Make -j allows several commands to run at
the same time.   It may be reasonable to incorporate some
of the same functionality in the GCC driver, so that it
starts processing threads in the background and exits.  (There
is the interesting question of how threads are re-synced.)

Parsing the source is inherently a sequential operation.
I don't think that it is possible to parse different include
files independently in separate threads, or even identify the
dependencies between include files.  But it may be possible
to use the results of parsing an include file (or sequence of
include files) from another instance of GCC which is executing
in a different process or thread.

Each of the functions in a C/C++ program is dependent on
the global environment, but each is independent of each other.
Separate threads could process the tree/RTL for each function
independently, with the results merged on completion.  This
may interact adversely with some global optimizations, such
as inlining.


--
Michael Eager[EMAIL PROTECTED]
1960 Park Blvd., Palo Alto, CA 94306  650-325-8077

Re: Threading the compiler

2006-11-11 Thread Michael Eager


Geert Bosch wrote:


Most of my compilations (on Linux, at least) use close
to 100% of CPU. Adding more overhead for threading and
communication/synchronization can only hurt.


On a single-processor system, adding overhead for multi-
threading does reduce performance.  On a multi-processor
system, the overhead is distributed across all processors, but
results in a net gain.  For parallelizations pprograms, a
4-way processor might achieve 3X performance improvement.

--
Michael Eager[EMAIL PROTECTED]
1960 Park Blvd., Palo Alto, CA 94306  650-325-8077

Re: strict aliasing question

2006-11-11 Thread Ian Lance Taylor

Howard Chu <[EMAIL PROTECTED]> writes:

> Daniel Berlin wrote:
> >
> > We ask the TBAA analyzer "can a store to a short * touch i.
> > In this case, it says "no", because it's not legal.
> >
> If you know the code is not legal, why don't you abort the compilation
> with an error code?

It's not actually that easy to detect the undefined cases.  Sometimes
it's easy, sure.  But most times it is not.  The compiler does not
normally do the sort of analysis which is required.

That said, one of my co-workers has developed a patch which detects
most aliasing violations, based on the compiler's existing alias
analysis.  It is able to give warnings for a wide range of cases which
the compiler does not currently detect, for a relatively small
increase in compilation time.  If everything works out right, we'll
propose it for gcc 4.3.

Ian

Re: gmp/mpfr and multilib

2006-11-11 Thread Brooks Moses


Jack Howarth wrote:

   Does anyone know how the changes for gcc to require gmp/mpfr will effect the
multilib builds? In the past, gmp/mpfr in gfortran appeared to only be linked 
into
the compiler itself so that a 32-bit/64-bit multilib build on Darwin PPC only
required gmp/mpfr for 32-bit to be installed. Will any of the libraries in
gcc now require gmp/mpfr such that both 32-bit and 64-bit versions of gmp/mpfr
must be installed? If that is the case, will the multilib build look for both a
lipo 32-bit/64-bit combined shared library in $prefix/lib as well as individual
versions in lib and lib64 subdirectories?


So far as I know, gmp/mpfr is still only being used for compile-time 
evaluation of constant expressions (in order to do so in a way that's 
not dependent on the host's architecture, as it may be different from 
the target's architecture).  I don't believe that there's any intention 
of using it in a way that would make it useful to link into libraries.


- Brooks

Re: Threading the compiler

2006-11-11 Thread sohail

> Each of the functions in a C/C++ program is dependent on
> the global environment, but each is independent of each other.
> Separate threads could process the tree/RTL for each function
> independently, with the results merged on completion.  This
> may interact adversely with some global optimizations, such
> as inlining.

Is it just me or could lazy evaluation really help here?

Ok maybe its just me.

Re: Polyhedron performance regression

2006-11-11 Thread FX Coudert

Just wanted to note to the list that Tobias spotted a performance  
regression on Polyhedron ac.


http://www.suse.de/~gcctest/c++bench/polyhedron/polyhedron- 
summary.txt-2-0.html


Hum, the performance change on ac is significant. Anyway we can get  
the revision numbers before and after the jump? (and before the last  
jump to zero)? What patches have been commited in that time that  
could affect this? I can't see anything on the fortran patches...


FX

Re: Polyhedron performance regression

2006-11-11 Thread Richard Guenther

On 11/11/06, FX Coudert <[EMAIL PROTECTED]> wrote:

> Just wanted to note to the list that Tobias spotted a performance
> regression on Polyhedron ac.
>
> http://www.suse.de/~gcctest/c++bench/polyhedron/polyhedron-
> summary.txt-2-0.html

Hum, the performance change on ac is significant. Anyway we can get
the revision numbers before and after the jump? (and before the last
jump to zero)? What patches have been commited in that time that
could affect this? I can't see anything on the fortran patches...

If I had to guess I would say it was the forwprop merge.  But I didn't
investigate.

Richard.

Re: Polyhedron performance regression

2006-11-11 Thread Richard Guenther

On Sat, 11 Nov 2006, FX Coudert wrote:

> >Just wanted to note to the list that Tobias spotted a performance regression
> >on Polyhedron ac.
> >
> >http://www.suse.de/~gcctest/c++bench/polyhedron/polyhedron-summary.txt-2-0.html
> 
> Hum, the performance change on ac is significant. Anyway we can get the
> revision numbers before and after the jump? (and before the last jump to
> zero)? What patches have been commited in that time that could affect this? I
> can't see anything on the fortran patches...

Must have between r118372 and r118615.

Richard.

Re: Polyhedron performance regression

2006-11-11 Thread Paul Thomas


Richard,


If I had to guess I would say it was the forwprop merge...

The what? :-)

Paul

Re: Polyhedron performance regression

2006-11-11 Thread Steven Bosscher

On 11/11/06, Paul Thomas <[EMAIL PROTECTED]> wrote:

Richard,
>
> If I had to guess I would say it was the forwprop merge...
The what? :-)

fwprop, see
http://gcc.gnu.org/ml/gcc-patches/2006-11/msg00141.html

If someone can confirm that this patch causes the drop, I can help
trying to find a fix.

Gr.
Steven

-funsafe-math-optimizations and -fno-rounding-math

2006-11-11 Thread Revital1 Eres


Hello,

-fno-rounding-math enables the transformation of (-(X - Y)) -> (Y - X)
in simplify-rtx.c  which seems to be the same transformation
that enabled by -funsafe-math-optimizations in fold-const.c.

If I understand currently -frounding-math means that the rounding mode is
important.
In that case should there be correlation between
-funsafe-math-optimizations
and -fno-rounding-math (which currently does not exist)?

Thanks,
Revital

Re: Polyhedron performance regression

2006-11-11 Thread Paul Thomas


Steven and Jerry,



If someone can confirm that this patch causes the drop, I can help
trying to find a fix.


amd64/Cygwin_NT

$ /irun/bin/gfortran -O3 -funroll-loops -ffast-math -march=opteron ac.f90

118372 20.2s
118475 20.4s  Bonzini's patch
118704 16.2s

I believe that the improvement is FX's and my patch for MOD.

Notice that this is a single core machine and that there is a PR out on 
the vectorizer.  Could this be the problem, since the suse tests are 
done on a two core machine, if I understood correctly?


Paul

Re: -funsafe-math-optimizations and -fno-rounding-math

2006-11-11 Thread Richard Guenther


On 11/11/06, Revital1 Eres <[EMAIL PROTECTED]> wrote:


Hello,

-fno-rounding-math enables the transformation of (-(X - Y)) -> (Y - X)
in simplify-rtx.c  which seems to be the same transformation
that enabled by -funsafe-math-optimizations in fold-const.c.

If I understand currently -frounding-math means that the rounding mode is
important.
In that case should there be correlation between
-funsafe-math-optimizations
and -fno-rounding-math (which currently does not exist)?


I think the simplify-rtx.c code is partly wrong, as it changes behavior
with signed zeros.  I don't know off-hand if -(X - Y) and Y - X behave
the same in rounding if the rounding mode is round to nearest, but
certainly for round to +Inf it will differ.  So HONOR_SIGNED_ZEROS (mode)
&& !flag_rounding_math might be the correct predicate here (and in
the fold-const.c case).

But floating point rounding scares me ;)

Richard.

Re: Polyhedron performance regression

2006-11-11 Thread Richard Guenther

On Sat, 11 Nov 2006, Paul Thomas wrote:

> Steven and Jerry,
> >
> >
> > If someone can confirm that this patch causes the drop, I can help
> > trying to find a fix.
> >
> amd64/Cygwin_NT
> 
> $ /irun/bin/gfortran -O3 -funroll-loops -ffast-math -march=opteron ac.f90
> 
> 118372 20.2s
> 118475 20.4s  Bonzini's patch
> 118704 16.2s
> 
> I believe that the improvement is FX's and my patch for MOD.

Note that the suse x86_64 tester has 18.6s with the last run before
bonzinis patch and 30.8s the first run after it, so it regressed quite
badly.  It also has -ftree-vectorize as additional option here.

> 
> Notice that this is a single core machine and that there is a PR out on the
> vectorizer.  Could this be the problem, since the suse tests are done on a two
> core machine, if I understood correctly?

Yes this is a dual-socket machine.  But I don't see how this can be
an issue here?

Richard.

--
Richard Guenther <[EMAIL PROTECTED]>
Novell / SUSE Labs

comments on getting the most out of multi-core machines

2006-11-11 Thread Bud Davis

my 'day job' is a medium sized software operation.  we
have between 5 and 50 programmers assigned to a given
project; and a project is usually a couple of thousand
source files (mix of f77,c,c++, ada).  all this source
get's stuck in between 50 and 100 libraries, and the
end result is less than a dozen executables...one of
which is much larger than the rest.
 
it is a rare single file that takes more than 30
seconds to compile (at least with gcc3 and higher). 
linking the largest executable takes about 3 minutes.

(sorry to be so long-winded getting to the topic!!)

ordinary case is i change a file and re-link.  takes
less than 3.5 minutes.  even if gcc was infinitely
fast, it would still be 3 minutes.

the other case is compiling everything from scratch
(which is done regularly).  Using a tool like SCons
which can build a total dependency graph, i have
learned that roughly j100 would be ideal.  of course i
am stuck with -j4 today.  given enough cores to throw
the work on, best case is still 3.5 minutes.  

(of course, this is a simplified analysis)

my point in all of this is that effort at the higher
policy levels (by making the build process
mult-threaded at the file level) pays off today and
for the near future.  

changing gcc to utilize multi-core systems may be a
lot harder and less beneficial than moving up the
problem space a notch or two.


regards,
bud davis

re: comments on getting the most out of multi-core machines

2006-11-11 Thread Howard Chu




* /From/: Bud Davis 
* /Date/: Sat, 11 Nov 2006 16:06:44 -0800 (PST)

it is a rare single file that takes more than 30
seconds to compile (at least with gcc3 and higher). 
linking the largest executable takes about 3 minutes.


(sorry to be so long-winded getting to the topic!!)

ordinary case is i change a file and re-link.  takes
less than 3.5 minutes.  even if gcc was infinitely
fast, it would still be 3 minutes.
  


Sounds like you'd be well served by an incremental linker, like AIX 
provides. But that's mostly a topic for a binutils list. The AIX tools 
have some good ideas, worth adopting more widely, like recording the 
sizes of functions in each object file. It basically allows each 
individual function to be treated separately, the way traditional 
linkers treat separate .o files. So individual functions can be replaced 
during a relink, leaving the majority of the object file intact (plus or 
minus relocations that moved along the way). It also somewhat blurs the 
distinction between a fully linked executable file and an intermediate 
relocatable object, since executables can also be incrementally 
relinked. It's a real timesaver when you just need to fix one file in a 
very large program.


--
 -- Howard Chu
 Chief Architect, Symas Corp.  http://www.symas.com
 Director, Highland Sunhttp://highlandsun.com/hyc
 OpenLDAP Core Teamhttp://www.openldap.org/project/

Re: comments on getting the most out of multi-core machines

2006-11-11 Thread Robert Dewar


Howard Chu wrote:

It also somewhat blurs the
distinction between a fully linked executable file and an intermediate 
relocatable object, since executables can also be incrementally 
relinked. It's a real timesaver when you just need to fix one file in a 
very large program.


a proper fix-and-continue functionality would be even more of a
timesaver, maybe that's where the energy should go ...

Re: strict aliasing question

2006-11-11 Thread Howard Chu


Ian Lance Taylor wrote:

Howard Chu <[EMAIL PROTECTED]> writes:

  

Daniel Berlin wrote:


We ask the TBAA analyzer "can a store to a short * touch i.
In this case, it says "no", because it's not legal.

  

If you know the code is not legal, why don't you abort the compilation
with an error code?



It's not actually that easy to detect the undefined cases.  Sometimes
it's easy, sure.  But most times it is not.  The compiler does not
normally do the sort of analysis which is required.
  


OK, that makes sense too. Dan's statement implied that there was a 
cut-and-dry test. If the analysis has not occurred, then you obviously 
cannot know that certain statements can be ignored. You can't even know 
that they're safe to re-order.

That said, one of my co-workers has developed a patch which detects
most aliasing violations, based on the compiler's existing alias
analysis.  It is able to give warnings for a wide range of cases which
the compiler does not currently detect, for a relatively small
increase in compilation time.  If everything works out right, we'll
propose it for gcc 4.3.
  


Here's a different example, which produces the weaker warning
warning: type-punning to incomplete type might break strict-aliasing rules

struct foo;

int blah(int fd) {
   int buf[BIG_ENOUGH];
   void *v = buf;
   struct foo *f;

   f = v;
   f = (struct foo *)buf;

   init(f, fd);
   munge(f);
   flush(f);
}

"foo" is an opaque structure. We have no idea what's inside, we just 
know that it's relatively small. There are allocators available that 
will malloc them for us, but we don't want to use malloc here because 
it's too slow, so we want to reserve space for it on the stack, do a few 
things with it, then forget it.


If we go through the temporary variable v, there's no warning. If we 
don't use the temporary variable, we get the "might break" message. In 
this case, nothing in our code will ever dereference the pointer.  Why 
is there any problem here, considering that using the temporary variable 
accomplishes exactly the same thing, but requires two extra statements?


--
 -- Howard Chu
 Chief Architect, Symas Corp.  http://www.symas.com
 Director, Highland Sunhttp://highlandsun.com/hyc
 OpenLDAP Core Teamhttp://www.openldap.org/project/

Re: strict aliasing question

2006-11-11 Thread Howard Chu


Mike Stump wrote:

On Nov 10, 2006, at 9:48 AM, Howard Chu wrote:

Richard Guenther wrote:

If you compile with -O3 -combine *.c -o alias it will break.


Thanks for pointing that out. But that's not a realistic danger for 
the actual application. The accessor function is always going to be 
in a library compiled at a separate time. The call will always be 
from a program built at a separate time, so -combine isn't a factor.


We are building a compiler to outsmart you.  We presently working on 
technology (google ("LTO")) to break your code.  :-)  Don't cry when 
we turn it on by default and it does.  I'd recommend understanding the 
rules and following them.


This raises another interesting point. Aggressive link time optimization 
may be nice in a lot of cases, but there are boundaries that should not 
(and most likely cannot) be crossed. E.g., you shouldn't go peering 
inside libraries to look behind their exported interfaces. You probably 
can't, in the case of a shared library, but you probably could for a 
static library. This also raises the question of what exactly a static 
library represents - is it just a group of object files collected 
together for convenience, as an intermediate step in a large build 
process, or is it a coherent entity that provides a strictly defined set 
of services? GNU libtool is known to create "convenience libraries" 
simply as a means of aggregating object files together, before doing 
something else with them. (A good argument can be made that this is 
stupid; they should be using "ld -r" for that purpose, but that's 
another story...) libtool isn't the only example, either. As a 
convenience collection, you should be free to globally optimize across 
it to your heart's content. But as an actual library, you should stop at 
its exported interface. How will you distinguish these two cases, when 
all you see is "foo.a" on the command line?


--
 -- Howard Chu
 Chief Architect, Symas Corp.  http://www.symas.com
 Director, Highland Sunhttp://highlandsun.com/hyc
 OpenLDAP Core Teamhttp://www.openldap.org/project/

RE: strict aliasing question

2006-11-11 Thread Dave Korn

On 12 November 2006 03:35, Howard Chu wrote:


> Here's a different example, which produces the weaker warning
>  warning: type-punning to incomplete type might break strict-aliasing rules
> 
> struct foo;
> 
> int blah(int fd) {
> int buf[BIG_ENOUGH];
> void *v = buf;
> struct foo *f;
> 
> f = v;
> f = (struct foo *)buf;
> 
> init(f, fd);
> munge(f);
> flush(f);
> }
> 
> "foo" is an opaque structure. We have no idea what's inside, we just
> know that it's relatively small. There are allocators available that
> will malloc them for us, but we don't want to use malloc here because
> it's too slow, so we want to reserve space for it on the stack, do a few
> things with it, then forget it.
> 
> If we go through the temporary variable v, there's no warning. If we
> don't use the temporary variable, we get the "might break" message.

  Try

> f = (struct foo *)(void *)buf;

  Or even better...

struct foo;

int blah(int fd) {
struct foo *f;

f = alloca (BIG_ENOUGH);

init(f, fd);
munge(f);
flush(f);
}


cheers,
  DaveK
-- 
Can't think of a witty .sigline today

Re: Threading the compiler

2006-11-11 Thread Ross Ridge

Ross Ridge wrote:
>Umm... those 80 processors that Intel is talking about are more like the
>8 coprocessors in the Cell CPU. 

Michael Eager wrote:
>No, the Cell is asymmetrical (vintage 2000) architecture.

The Cell CPU as a whole is asymmetrical, but I'm only comparing the
design to the 8 identical coprocessors (of which only 7 are enabled in
the CPU used in the PlayStation 3).

>Intel & AMD have announced that they are developing large multi-core
>symmetric processors.  The timelines I've seen say that the number of
>cores on each chip will double every year or two. 

This doesn't change that fact that SMP systems don't scale well after
16 processors or so.  To go beyond that you need a different design.
Clustering and NUMA have been ways of solving the problem outside the
chip.  Intel's plan for solving it inside the chip involves giving each
of the 80 cores it's own 32 MB of SRAM and only connecting each core to
its immediate neighbours.  This is similiar to the Cell SPE's.  Each has
256K of local memory and they're all connected together in a ring.

> Moore's law hasn't stopped.

While Moore's Law may still be holding on, bus and memory speeds aren't
doubling every two years.  You can't design an 80 core CPU like an 4 core
CPU with 20 times as many cores.  Having 80 processors all competing over
the same bus for the same memory won't work.  Neither will "make -j80".
You need to do more than just divide up the work between different
processes or threads.  You need to divide up the program and data into
chunks that will fit into each core's local memory and orchestrate
everything so that the data propagates smoothly between cores.

> The number of gates per chip doubles every 18 months.

Actually, in fact it's closer to doubling every 24 months and Gordon
Moore never said it would double every 18 months.  Originaly in 1965
he said that the number of components doubled every year, in 1975 after
things slowed down he revised it to doubling every two years.

Ross Ridge

RE: strict aliasing question

2006-11-11 Thread Dave Korn

On 12 November 2006 04:16, Howard Chu wrote:

> Dave Korn wrote:
>> On 12 November 2006 03:35, Howard Chu wrote:
>> 
>> 
>>> If we go through the temporary variable v, there's no warning. If we
>>> don't use the temporary variable, we get the "might break" message.
>>> 
>> 
>>   Try
>> 
>> 
>>> f = (struct foo *)(void *)buf;
>>> 
>> 
>> 
> That's good, but why is it safe?


  Passing through void* means gcc has to assume it could alias anything,
IIUIC, as a result of the standard allowing implicit void*<=>T* conversions.


cheers,
  DaveK
-- 
Can't think of a witty .sigline today

Re: strict aliasing question

2006-11-11 Thread Ian Lance Taylor

Howard Chu <[EMAIL PROTECTED]> writes:

> Here's a different example, which produces the weaker warning
>  warning: type-punning to incomplete type might break strict-aliasing rules
> 
> struct foo;
> 
> int blah(int fd) {
> int buf[BIG_ENOUGH];
> void *v = buf;
> struct foo *f;
> 
> f = v;
> f = (struct foo *)buf;
> 
> init(f, fd);
> munge(f);
> flush(f);
> }
> 
> "foo" is an opaque structure. We have no idea what's inside, we just
> know that it's relatively small. There are allocators available that
> will malloc them for us, but we don't want to use malloc here because
> it's too slow, so we want to reserve space for it on the stack, do a
> few things with it, then forget it.
> 
> If we go through the temporary variable v, there's no warning. If we
> don't use the temporary variable, we get the "might break" message. In
> this case, nothing in our code will ever dereference the pointer.  Why
> is there any problem here, considering that using the temporary
> variable accomplishes exactly the same thing, but requires two extra
> statements?

Since you don't do any loads or stores via buf, this code is going to
be OK.  The warning you get is not all that good since it gives both
false positives and (many) false negatives.

Your code will be safe on all counts if you change buf from int[] to
char[].  The language standard grants a special exemption to char*
pointers.  Without that exemption, it would be impossible to write
malloc in C.

Ian

Re: strict aliasing question

2006-11-11 Thread Andrew Pinski

On Sat, 2006-11-11 at 22:18 -0800, Ian Lance Taylor wrote:
> 
> Your code will be safe on all counts if you change buf from int[] to
> char[].  The language standard grants a special exemption to char*
> pointers.  Without that exemption, it would be impossible to write
> malloc in C.

Actually it is not that what the C standard allows.  What the C standard
says is accesses via the character type is always valid and the normal
type (and signed/unsigned version of both the normal and character
type).  This means accessing an element of the character array via any
other type except via an unsigned/signed character type is undefined.

Thanks,
Andrew Pinski

Re: strict aliasing question

2006-11-11 Thread Howard Chu


Andrew Pinski wrote:

On Sat, 2006-11-11 at 22:18 -0800, Ian Lance Taylor wrote:
  

Your code will be safe on all counts if you change buf from int[] to
char[].  The language standard grants a special exemption to char*
pointers.  Without that exemption, it would be impossible to write
malloc in C.



As I recall, we chose int[] for alignment reasons, figuring we'd have no 
guarantees on the alignment of a char[].



Actually it is not that what the C standard allows.  What the C standard
says is accesses via the character type is always valid and the normal
type (and signed/unsigned version of both the normal and character
type).  This means accessing an element of the character array via any
other type except via an unsigned/signed character type is undefined.
  


Right, I've just read that text as well, which is why I'm still 
wondering. But as Ian said, we never do any loads or stores into the 
actual buf, so it seems we don't need to care care whether its value is 
defined or not. If that's a safe assumption, then I propose that this is 
a rule worth stating:

   aliasing means two pointers point to the same memory
   if only one pointer is ever used to access that memory, aliasing 
doesn't matter


--
 -- Howard Chu
 Chief Architect, Symas Corp.  http://www.symas.com
 Director, Highland Sunhttp://highlandsun.com/hyc
 OpenLDAP Core Teamhttp://www.openldap.org/project/

Re: strict aliasing question

2006-11-11 Thread Dale Johannesen



On Nov 11, 2006, at 10:45 PM, Howard Chu wrote:


Andrew Pinski wrote:

On Sat, 2006-11-11 at 22:18 -0800, Ian Lance Taylor wrote:


Your code will be safe on all counts if you change buf from int[] to
char[].  The language standard grants a special exemption to char*
pointers.  Without that exemption, it would be impossible to write
malloc in C.



As I recall, we chose int[] for alignment reasons, figuring we'd  
have no guarantees on the alignment of a char[].


True, but add __attribute__((aligned(4))) and all is well.

38 matches

Mail list logo