Re: Problem with commas in macro parameters

2005-11-06 Thread Ralf Wildenhues
* Dmitry Yu. Bolkhovityanov wrote on Sun, Nov 06, 2005 at 07:21:56AM CET:
> 
>   That's probably an old problem, but I haven't found any notion of 
> it in GCC docs. So...

It's one better discussed on the gcc-help mailing list.

>   #define V(value) = value

>   This works fine, until I try to pass it some complex value:
> 
> D int some_array[2] V({4,5})

If you can limit yourself to C99-aware preprocessors, use
  #define V(...) = __VA_ARGS__

Cheers,
Ralf


Call for compiler help/advice: atomic builtins for v3

2005-11-06 Thread Paolo Carlini

Hi,

we have this long standing issue which really we should solve, one way 
or another: otherwise there are both correctness and performance issues 
which we cannot fix, new features which we cannot implement. I have 
plenty of examples, just ask, in case, if you want more details and 
motivations.


In a nutshell, the problem is that there is no easy way to use in the 
library *headers* the new atomic builtins: we want the builtins in the 
headers, because we want inlining, but we cannot know whether the 
builtins are actually available. This is of course because of targets 
like i686-* (I think old Sparcs is another example), when by default the 
generated code is i386 compatible. In such cases, more generally, the 
availability or not of the builtins depends on the actual command line 
switches (e.g., -march).


Thus my request: would it be possible to have available the builtins 
unconditionally, by way of a slow (locks) fallback replacing the real 
implementation when the actual target code doesn't allow for them? I 
think it's rather obvious that this is the cleanest solution for the 
issue by far. Alternately, something I could probably implement myself 
rather quickly, export a preprocessor builtin, which could be exploited 
in macros. As I said, I already explored a bit this solution time ago 
and seems to me very easy to implement, but really I rather prefer the 
first one. However, if you really believe the latter is the only 
possible way to go, I can work on it immediately: the issue is on top of 
my priorities for the next weeks.


Other proposals?

I'm going to link this message to a new enhancement PR, anyway.

Thanks in advance,
Paolo.


Re: [RFC] Enabling loop unrolls at -O3?

2005-11-06 Thread Richard Guenther
On 11/6/05, Robert Dewar <[EMAIL PROTECTED]> wrote:
> Giovanni Bajo wrote:
>
> > I believe you are missing my point. What is the GCC command line option for
> > "try to optimize as best as you can, please, I don't care compiletime"? I
> > believe that should be -O3. Otherwise let's make -O4. Or -O666. The only 
> > real
> > argument I heard till now is that -funroll-loops isn't valuable without 
> > profile
> > feedback. My experience is that it isn't true, I for sure use it for profit 
> > in
> > my code. But it looks like the only argument that could make a difference is
> > SPEC, and SPEC is not freely available. So I'd love if someone could
> > SPEC -funroll-loops for me.
>
> It is not at all the case that SPEC is the only good argument, in fact
> SPEC on its own is a bad argument. Much more important is impact on real
> life applications, so your data point that it makes a significant difference
> for your application is more interesting than SPEC marks. When it comes
> to GCC, we are more interested in performance than good SPEC figures!

Of course SPEC consists of real life applications.  Whether it is a good
representation for todays real life applications is another question, but
it certainly is a very good set of tests.

Richard.


Re: Call for compiler help/advice: atomic builtins for v3

2005-11-06 Thread Richard Guenther
On 11/6/05, Paolo Carlini <[EMAIL PROTECTED]> wrote:
> Hi,
>
> we have this long standing issue which really we should solve, one way
> or another: otherwise there are both correctness and performance issues
> which we cannot fix, new features which we cannot implement. I have
> plenty of examples, just ask, in case, if you want more details and
> motivations.
>
> In a nutshell, the problem is that there is no easy way to use in the
> library *headers* the new atomic builtins: we want the builtins in the
> headers, because we want inlining, but we cannot know whether the
> builtins are actually available. This is of course because of targets
> like i686-* (I think old Sparcs is another example), when by default the
> generated code is i386 compatible. In such cases, more generally, the
> availability or not of the builtins depends on the actual command line
> switches (e.g., -march).
>
> Thus my request: would it be possible to have available the builtins
> unconditionally, by way of a slow (locks) fallback replacing the real
> implementation when the actual target code doesn't allow for them? I
> think it's rather obvious that this is the cleanest solution for the
> issue by far. Alternately, something I could probably implement myself
> rather quickly, export a preprocessor builtin, which could be exploited
> in macros. As I said, I already explored a bit this solution time ago
> and seems to me very easy to implement, but really I rather prefer the
> first one. However, if you really believe the latter is the only
> possible way to go, I can work on it immediately: the issue is on top of
> my priorities for the next weeks.
>
> Other proposals?

We could just provide fallback libcalls in libgcc - though usually you
would want a fallback on a higher level to avoid too much overhead.
So - can't you work with some preprocessor magic and a define, if
the builtins are available?

Richard.


Re: Call for compiler help/advice: atomic builtins for v3

2005-11-06 Thread Paolo Carlini

Richard Guenther wrote:

We could just provide fallback libcalls in libgcc
Indeed, this is an option. Not one I can implement myself quickly, but I 
think the idea of issuing a library call when the builtin is not 
available was actually meant to enable this kind of solution.


Can you work on it?

 - though usually you
would want a fallback on a higher level to avoid too much overhead.
  
Well, we are talking about i386... But yes, that could make sense, for 
example if you want to compile once both for i386 and i686 and obtain 
decent performance everywhere. Consider, however, that currently we end 
up doing a function call anyway, no inlining, so...

So - can't you work with some preprocessor magic and a define, if
the builtins are available?
  
I don't really understand this last remark of yours: is it an alternate 
solution?!? Any preprocessor magic has to rely on a new preprocessor 
builtin, in case, because there is no consistent, unequivocal way to 
know whether the builtins are available for the  actual target, we 
already explored that some time ago, in our (SUSE) gcc-development list too.


Paolo.


Re: Call for compiler help/advice: atomic builtins for v3

2005-11-06 Thread Paolo Carlini

Richard Guenther wrote:

We could just provide fallback libcalls in libgcc
All in all, I think this is really the best solution. For 4.2 Sparc will 
also have the builtins available and even if we want that the libgcc 
code is equivalent to what is currently available in 
libstdc++-v3/config/cpu, that is, not just locks when possible, the 
amount of code added to libgcc would be very limited. Then, during the 
transition phase we could help ourselves with some macro tricks, but 
only temporarily.


Paolo.


Re: [RFC] Enabling loop unrolls at -O3?

2005-11-06 Thread Gabriel Dos Reis
Robert Dewar <[EMAIL PROTECTED]> writes:

| Steven Bosscher wrote:
| 
| > You must not have been paying attention to one of the most frequent
| > complaints about gcc, which is that it is dog slow already ;-)
| 
| Sure, but to me -O2 says you don't care much about compilation time.

If the Ada front-end wishes, it can make special flags for its own
needs... 

-- Gaby


Re: Is -fvisibility patch possible on GCC 3.3.x

2005-11-06 Thread Gabriel Dos Reis
"Gary M Mann" <[EMAIL PROTECTED]> writes:

| Hi,
| 
| The -fvisibility feature in GCC 4.0 is a really useful way of hiding all
| non-public symbols in a dynamic shared object.
| 
| While I'm aware of a patch which backports this feature to GCC 3.4 (over at
| nedprod.com), I was wondering whether there is a similar patch available for
| GCC 3.3. I'm aware that GCC 3.3 (and some vintages of 3.2) support the
| __attribute__ means of setting a symbol's visibility, but I want to be able
| to change the default visibility for all symbols.
| 
| The problem is that we're stuck for now with the GCC 3.3 compiler as we need
| v5 ABI compatibility for Orbix 6.2 and cannot move to 3.4 until Iona catch
| up.
| 
| Does anyone know if such a patch exists, or even if it is feasible in the
| 3.3 framework?

I'm not aware of any such patch.  However, beware that the visibility
patch comes with its own can of worms.

-- Gaby


Re: [RFC] Enabling loop unrolls at -O3?

2005-11-06 Thread Giovanni Bajo
Gabriel Dos Reis <[EMAIL PROTECTED]> wrote:

>>> You must not have been paying attention to one of the most frequent
>>> complaints about gcc, which is that it is dog slow already ;-)
>>
>> Sure, but to me -O2 says you don't care much about compilation time.
>
> If the Ada front-end wishes, it can make special flags for its own
> needs...


Why are you speaking of the Ada frontend?

If -O1 means "optimize, but be fast", what does -O2 mean? And what does -O3
mean? If -O2 means "the current set of optimizer that we put in -O2", that's
unsatisfying for me.

Giovanni Bajo



Re: [RFC] Enabling loop unrolls at -O3?

2005-11-06 Thread Jakub Jelinek
On Sun, Nov 06, 2005 at 01:32:43PM +0100, Giovanni Bajo wrote:
> If -O1 means "optimize, but be fast", what does -O2 mean? And what does -O3
> mean? If -O2 means "the current set of optimizer that we put in -O2", that's
> unsatisfying for me.

`-O2'
 Optimize even more.  GCC performs nearly all supported
 optimizations that do not involve a space-speed tradeoff.  The
 compiler does not perform loop unrolling or function inlining when
 you specify `-O2'.  As compared to `-O', this option increases
 both compilation time and the performance of the generated code.

Including loop unrolling to -O2 is IMNSHO a bad idea, as loop unrolling
increases code size, sometimes a lot.  And the distinction between -O2
and -O3 is exactly in the space-for-speed tradeoffs.

On many CPUs for many programs, -O3 generates slower code than -O2,
because the cache footprint  disadvantages override positive effects
of the loop unrolling, extra inlining etc.

Jakub


Re: Call for compiler help/advice: atomic builtins for v3

2005-11-06 Thread Richard Guenther
On 11/6/05, Paolo Carlini <[EMAIL PROTECTED]> wrote:
> Richard Guenther wrote:
> > We could just provide fallback libcalls in libgcc
> Indeed, this is an option. Not one I can implement myself quickly, but I
> think the idea of issuing a library call when the builtin is not
> available was actually meant to enable this kind of solution.
>
> Can you work on it?
> >  - though usually you
> > would want a fallback on a higher level to avoid too much overhead.
> >
> Well, we are talking about i386... But yes, that could make sense, for
> example if you want to compile once both for i386 and i686 and obtain
> decent performance everywhere. Consider, however, that currently we end
> up doing a function call anyway, no inlining, so...
> > So - can't you work with some preprocessor magic and a define, if
> > the builtins are available?
> >
> I don't really understand this last remark of yours: is it an alternate
> solution?!? Any preprocessor magic has to rely on a new preprocessor
> builtin, in case, because there is no consistent, unequivocal way to
> know whether the builtins are available for the  actual target, we
> already explored that some time ago, in our (SUSE) gcc-development list too.

Can you point me to some libstdc++ class/file where you use the
builtins or other solution?  Of course I meant having another builtin
preprocessor macro.  As the builtins are pretty low-level, somebody
might provide a low-level workaround if they're missing, too, instead
of having libcalls to libgcc (like for a freestanding environment like
the kernel).  So ultimately a preprocessor driven solution maybe the
better one.

Richard.


Re: Call for compiler help/advice: atomic builtins for v3

2005-11-06 Thread Paolo Carlini
Richard Guenther wrote:

>Can you point me to some libstdc++ class/file where you use the
>builtins or other solution?
>
Simply config/cpu/*/atomicity.h will do, for ia64, powerpc, ia64, alpha,
s390, currently to implement __exchange_and_add and __atomic_add. Note
that in this way the latter are *not* inlined, because atomicity.h is
compiled in the *.so :-(

Anyway, as you can see, currently we are using only one builtin,
__sync_fetch_and_add, and only for one type, _Atomic_word, but if we can
rely on the builtins to be available we can call them directly from the
*headers* for a big performance benefit. Also in many other places, of
course, e.g., lock-free tr1::shared_ptr.

I'm still unsure, however, which is the most flexible and clean
strategy: a fall back library call mechanism in principle (could) allow
to easily obtain a behavior very similar to the current one when you
don't pass any special -march at compile time, at the cost of worse
performance indeed, but meaning a single executable both for i386 and
i686. I think it is possible to have libgcc code tailored to the
effective target (i386 vs i686), right?

If possible, I think the latter is a wanted feature. Maybe there is
something I don't understand about the libgcc-based mechanisms... I'm
eager to learn more...

Paolo.


Re: [RFC] Enabling loop unrolls at -O3?

2005-11-06 Thread Robert Dewar

Richard Guenther wrote:


Of course SPEC consists of real life applications.  Whether it is a good
representation for todays real life applications is another question, but
it certainly is a very good set of tests.


It's mostly small programs by its nature, it is not very practical
to include real million line applications in a test suite like this.
To say it is certainly a very good set of tests is fairly contentious
I would say, since time and time again, the experience is that spec
marc values are not indicative of real performance. Now oddly enough
that's *least* true of gcc, since compared to other compilers less
effort has been put into tweaking spec numbers. But I still maintain
I am more interested in figures from real customer applications than
bench mark suites of this kind. And the usual discovery is that
optimizations are less effective, at least in my experience.



Re: [RFC] Enabling loop unrolls at -O3?

2005-11-06 Thread Robert Dewar

Gabriel Dos Reis wrote:

Robert Dewar <[EMAIL PROTECTED]> writes:

| Steven Bosscher wrote:
| 
| > You must not have been paying attention to one of the most frequent

| > complaints about gcc, which is that it is dog slow already ;-)
| 
| Sure, but to me -O2 says you don't care much about compilation time.


If the Ada front-end wishes, it can make special flags for its own
needs... 


Yes, of course, though there would have to be very good motivation
for this ...




Re: [RFC] Enabling loop unrolls at -O3?

2005-11-06 Thread Robert Dewar

Giovanni Bajo wrote:


If -O1 means "optimize, but be fast", what does -O2 mean? And what does -O3
mean? If -O2 means "the current set of optimizer that we put in -O2", that's
unsatisfying for me.


Right, that's exactly my point, you want some clear statement of the
goals of different levels, defining -On as merely being better
performance than -O(n+1) at the expense of worse compile time
is really not sufficient.

With large applications, it is often a big job to test out different
optimization levels, so this is not always the practical way of choosing.



Re: [RFC] Enabling loop unrolls at -O3?

2005-11-06 Thread Robert Dewar

Jakub Jelinek wrote:


Including loop unrolling to -O2 is IMNSHO a bad idea, as loop unrolling
increases code size, sometimes a lot.  And the distinction between -O2
and -O3 is exactly in the space-for-speed tradeoffs.


That's certainly a valid way of defining the difference (and certainly
used to be the case in the old days when the principle extra optimization
was inlining)


On many CPUs for many programs, -O3 generates slower code than -O2,
because the cache footprint  disadvantages override positive effects
of the loop unrolling, extra inlining etc.


That's what we have found, though I would have thought it unusual that
loop unrolling would run into this cache effect in most cases.


Jakub





Re: [RFC] Enabling loop unrolls at -O3?

2005-11-06 Thread Mattias Engdeg�rd
Robert Dewar <[EMAIL PROTECTED]> writes:

>> Including loop unrolling to -O2 is IMNSHO a bad idea, as loop unrolling
>> increases code size, sometimes a lot.  And the distinction between -O2
>> and -O3 is exactly in the space-for-speed tradeoffs.

>That's certainly a valid way of defining the difference (and certainly
>used to be the case in the old days when the principle extra optimization
>was inlining)

But even -O2 makes several space-for-speed optimisations (multiply by
shifting and adding, align jump targets, etc), so this cannot define
the difference between -O2 and -O3. It is more quantitative in nature:
-O2 only generates bigger code where the payoff in speed is almost
certain, which is not always the case for unrolling/inlining.

Still, more and more projects seem to be switching to -Os for everything
which could be interpreted as both a consequence of the inevitable
bloating of large programs but also as a dissatisfaction with what -O2/3
does to the code. Or maybe they are wisening up and actually using the
option that has been there all along.



Re: [RFC] Enabling loop unrolls at -O3?

2005-11-06 Thread René Rebe
Hi,

On Sunday 06 November 2005 16:20, Mattias Engdegård wrote:
> Robert Dewar <[EMAIL PROTECTED]> writes:
> >> Including loop unrolling to -O2 is IMNSHO a bad idea, as loop unrolling
> >> increases code size, sometimes a lot.  And the distinction between -O2
> >> and -O3 is exactly in the space-for-speed tradeoffs.
> >
> >That's certainly a valid way of defining the difference (and certainly
> >used to be the case in the old days when the principle extra optimization
> >was inlining)
>
> But even -O2 makes several space-for-speed optimisations (multiply by
> shifting and adding, align jump targets, etc), so this cannot define
> the difference between -O2 and -O3. It is more quantitative in nature:
> -O2 only generates bigger code where the payoff in speed is almost
> certain, which is not always the case for unrolling/inlining.
>
> Still, more and more projects seem to be switching to -Os for everything
> which could be interpreted as both a consequence of the inevitable
> bloating of large programs but also as a dissatisfaction with what -O2/3
> does to the code. Or maybe they are wisening up and actually using the
> option that has been there all along.

We used -Os in the past, however I saw major speed regression for C++ with -Os 
and 4.0.0. I would need to meassure again with 4.0.x or 4.1 to see if that 
changed after the .0 release.

Yours,

-- 
René Rebe - Rubensstr. 64 - 12157 Berlin (Europe / Germany)
http://www.exactcode.de | http://www.t2-project.org
+49 (0)30  255 897 45


pgp5hgTPcRXjm.pgp
Description: PGP signature


Re: Call for compiler help/advice: atomic builtins for v3

2005-11-06 Thread Howard Hinnant

On Nov 6, 2005, at 6:03 AM, Paolo Carlini wrote:


So - can't you work with some preprocessor magic and a define, if
the builtins are available?

I don't really understand this last remark of yours: is it an  
alternate solution?!? Any preprocessor magic has to rely on a new  
preprocessor builtin, in case, because there is no consistent,  
unequivocal way to know whether the builtins are available for the   
actual target, we already explored that some time ago, in our  
(SUSE) gcc-development list too.


Coincidentally I also explored this option in another product.  We  
ended up implementing it and it seemed to work quite well.  It did  
require the back end to "register" with the preprocessor those  
builtins it implemented, and quite frankly I don't know exactly how  
that registration worked.  But from a library writer's point of view,  
it was pretty cool.  For example:


inline
unsigned
count_bits32(_CSTD::uint32_t x)
{
#if __has_intrinsic(__builtin_count_bits32)
return __builtin_count_bits32(x);
#else
x -= (x >> 1) & 0x;
x = (x & 0x) + ((x >> 2) & 0x);
x = (x + (x >> 4)) & 0x0F0F0F0F;
x += x >> 8;
x += x >> 16;
return (unsigned)x & 0xFF;
#endif
}

In general, if __builtin_XYZ is implemented in the BE, then  
__has_intrinsic(__builtin_XYZ) answers true, else it answers false.   
This offers a lot of generality to the library writer.  Generic  
implementations are written once and for all (in the library), and  
need not be inlined.  Furthermore, the library can test  
__has_intrinsic(__builtin_XYZ) in places relatively unrelated to   
__builtin_XYZ (such as a client of __builtin_XYZ) and perform  
different algorithms in the client depending on whether the builtin  
existed or not (clients of the atomic builtins might especially value  
such flexibility).


Also during testing the "macro-ness" of this facility can come in  
really handy:


// Test generic library "builtins"
#undef __has_intrinsic
#define __has_intrinsic(x) 0

// All __builtins are effectively disabled here,
// except of course those not protected by __has_intrinsic.

The work in the back end seems somewhat eased as well as the back end  
can completely ignore the builtin unless it wants to implement it.   
And then there is only the extra "registration" with __has_intrinsic  
to deal with.


My apologies for rehashing an old argument.  If there are pointers to:

 we already explored that some time ago, in our (SUSE) gcc- 
development list too.


I would certainly take the time to review that thread.

-Howard



Re: Call for compiler help/advice: atomic builtins for v3

2005-11-06 Thread Mark Mitchell
Paolo Carlini wrote:
> Hi,
> 
> we have this long standing issue which really we should solve, one way
> or another: otherwise there are both correctness and performance issues
> which we cannot fix, new features which we cannot implement. I have
> plenty of examples, just ask, in case, if you want more details and
> motivations.

I think this is a somewhat difficult problem because of the tension
between performance and functionality.  In particular, as you say, the
code sequence you want to use varies by CPU.

I don't think I have good answers; this email is just me musing out loud.

You probably don't want to inline the assembly code equivalent of:

  if (cpu == i386) ...
  else if (cpu == i486) ...
  else if (cpu == i586) ...
  ...

On the other hand, if you inline, say, the i486 variant, and then run on
a i686, you may not get very good performance.

So, the important thing is to weigh the cost of a function call plus
run-time conditionals (when using a libgcc routine that would contain
support for all the CPUs) against the benefit of getting the fastest
code sequences on the current processors.

And in a workstation distribution you may be concerned about supporting
multiple CPUs; if you're building for a specific hardware board, then
you only care about the CPU actually on that board.

What do you propose that the libgcc routine do for a CPU that cannot
support the builtin at all?  Just do a trivial implementation that is
safe only for a single-CPU, single-threaded system?

I think that to satisfy everyone, you may need a configure option to
decide between inlining support for a particular processor (for maximum
performance when you know the target performance) and making a library
call (when you don't).

-- 
Mark Mitchell
CodeSourcery, LLC
[EMAIL PROTECTED]
(916) 791-8304


Re: Call for compiler help/advice: atomic builtins for v3

2005-11-06 Thread Paolo Carlini
Hi Howard,

> Coincidentally I also explored this option in another product.  We 
> ended up implementing it and it seemed to work quite well.  It did 
> require the back end to "register" with the preprocessor those 
> builtins it implemented, and quite frankly I don't know exactly how 
> that registration worked.  But from a library writer's point of view, 
> it was pretty cool.  For example:
>
> inline
> unsigned
> count_bits32(_CSTD::uint32_t x)
> {
> #if __has_intrinsic(__builtin_count_bits32)
> return __builtin_count_bits32(x);
> #else
> x -= (x >> 1) & 0x;
> x = (x & 0x) + ((x >> 2) & 0x);
> x = (x + (x >> 4)) & 0x0F0F0F0F;
> x += x >> 8;
> x += x >> 16;
> return (unsigned)x & 0xFF;
> #endif
> }
>
> In general, if __builtin_XYZ is implemented in the BE, then 
> __has_intrinsic(__builtin_XYZ) answers true, else it answers false.  
> This offers a lot of generality to the library writer.  Generic 
> implementations are written once and for all (in the library), and 
> need not be inlined.

Indeed, we are striving for generality. The mechanism that you suggest
would be rather easy to implement, in principle: as I wrote earlier
today, it's pretty simple to extract the info and prepare a new
preprocessor builtin that tells you for sure whether that specific
target has the atomics implemented or not. Later I can also tell you the
exact files and functions which would be likely touched, in case, have
to dig a bit in my disk ;)

Then, however, where to put the fallback *assembly* for each
*target-specific* atomic built-in? The most natural choice seems to me
libgcc, because our infrastructure of builtins *automatically* issues
library calls at run time when the builtin is not available. That
solution would avoid once and for all playing tricks with macros like
the above, IMHO. And it's flexible also for many different targets, some
even requiring different subversions (the obnoxious i386 vs i686...).

To repeat my "philosophy", the idea of having compiler builtins is good,
very good, but then we should, IMHO, complete the offloading of this
low-level issue to the compiler, letting the compiler (+ libgcc, in
case) to take care of generating code for __exchange_and_add, or whatever.

Paolo.


Re: Call for compiler help/advice: atomic builtins for v3

2005-11-06 Thread Richard Henderson
On Sun, Nov 06, 2005 at 11:34:30AM +0100, Paolo Carlini wrote:
> Thus my request: would it be possible to have available the builtins 
> unconditionally, by way of a slow (locks) fallback replacing the real 
> implementation when the actual target code doesn't allow for them?

I suppose that in some cases it would be possible to implement
them in libgcc.  Certainly we provided for that possibility 
by expanding to external calls.

Not all targets are going to be able to implement the builtins,
even with locks.  It is imperitive that the target have an 
atomic store operation, so that other read-only references to
the variable see either the old or new value, but not a mix.


r~


Re: Call for compiler help/advice: atomic builtins for v3

2005-11-06 Thread Paolo Carlini
Hi Mark,

>I think this is a somewhat difficult problem because of the tension
>between performance and functionality.  In particular, as you say, the
>code sequence you want to use varies by CPU.
>
>I don't think I have good answers; this email is just me musing out loud.
>
>You probably don't want to inline the assembly code equivalent of:
>
>  if (cpu == i386) ...
>  else if (cpu == i486) ...
>  else if (cpu == i586) ...
>  ...
>
>On the other hand, if you inline, say, the i486 variant, and then run on
>a i686, you may not get very good performance.
>
>So, the important thing is to weigh the cost of a function call plus
>run-time conditionals (when using a libgcc routine that would contain
>support for all the CPUs) against the benefit of getting the fastest
>code sequences on the current processors.
>  
>
Actually, the situation is not as bad, as far as I can see: the worst
case is i386 vs i486+, and Old-Sparc vs New-Sparc. More generally, a
targer either cannot implement the builtin at all (a trivial fall back
using locks or no MT support at all) or can in no more than 1
non-trivial way. Then libgcc would contain at most 2 versions: the
trivial one, and another piece of assembly, absolutely identical in
principle to what the builtin is expanded too in case the inline version
is actually desired.

>And in a workstation distribution you may be concerned about supporting
>multiple CPUs; if you're building for a specific hardware board, then
>you only care about the CPU actually on that board.
>
>What do you propose that the libgcc routine do for a CPU that cannot
>support the builtin at all?  Just do a trivial implementation that is
>safe only for a single-CPU, single-threaded system?
>  
>
Either that or a very low performance one, using locks. The issue it's
still open, we can resolve it rather easily, I think.

>I think that to satisfy everyone, you may need a configure option to
>decide between inlining support for a particular processor (for maximum
>performance when you know the target performance) and making a library
>call (when you don't).
>  
>
Yes, let's consider for simplicity the obnoxious i686: if the user
doesn't passes any -march then the fallback using locks is picked from
libgcc or the non-trivial implementation if the specific target (i486+)
supports it; if the user passes -march=i486+ then the builtin is
expanded inline by the compiler, no use of libgcc at all. Similarly for
Sparc.

Paolo.


Re: Call for compiler help/advice: atomic builtins for v3

2005-11-06 Thread Mark Mitchell
Paolo Carlini wrote:

> Actually, the situation is not as bad, as far as I can see: the worst
> case is i386 vs i486+, and Old-Sparc vs New-Sparc. More generally, a
> targer either cannot implement the builtin at all (a trivial fall back
> using locks or no MT support at all) or can in no more than 1
> non-trivial way. 

Are you saying that you don't expect there to ever be an architecture
that might have three or more ways of doing locking?  That seems rather
optimistic to me.  I think we ought to plan for needing as many versions
as we have CPUs, roughly speaking.

As for the specific IA32 case, I think you're suggesting that we ought
to inline the sequence if (at least) -march=i486 is passed. If we
currently use the same sequences for all i486 and higher processors,
then that's a fine idea; there's not much point in making a library call
to a function that will just do that one sequence, with the only benefit
being that someone might later be able to replace that function if some
as-of-yet uknown IA32 processor needs a different sequence.

-- 
Mark Mitchell
CodeSourcery, LLC
[EMAIL PROTECTED]
(916) 791-8304


Re: Call for compiler help/advice: atomic builtins for v3

2005-11-06 Thread Richard Guenther
On 11/6/05, Mark Mitchell <[EMAIL PROTECTED]> wrote:
> Paolo Carlini wrote:
>
> > Actually, the situation is not as bad, as far as I can see: the worst
> > case is i386 vs i486+, and Old-Sparc vs New-Sparc. More generally, a
> > targer either cannot implement the builtin at all (a trivial fall back
> > using locks or no MT support at all) or can in no more than 1
> > non-trivial way.
>
> Are you saying that you don't expect there to ever be an architecture
> that might have three or more ways of doing locking?  That seems rather
> optimistic to me.  I think we ought to plan for needing as many versions
> as we have CPUs, roughly speaking.
>
> As for the specific IA32 case, I think you're suggesting that we ought
> to inline the sequence if (at least) -march=i486 is passed. If we
> currently use the same sequences for all i486 and higher processors,
> then that's a fine idea; there's not much point in making a library call
> to a function that will just do that one sequence, with the only benefit
> being that someone might later be able to replace that function if some
> as-of-yet uknown IA32 processor needs a different sequence.

Just to put some more thoughts on the table, I'm about to propose adding
a __gcc_cpu_feature symbol to $suitable_place, similar to what Intel is
doing with its __intel_cpu_indicator which is used in their runtime libraries
to select different code paths based on processor capabilities.  I'm not
yet sure if and how to best expose this to the user, but internally this could
be mapped 1:1 to what we have in the TARGET_* flags.

Richard.


Re: Call for compiler help/advice: atomic builtins for v3

2005-11-06 Thread Paolo Carlini
Mark Mitchell wrote:

>Paolo Carlini wrote:
>
>>Actually, the situation is not as bad, as far as I can see: the worst
>>case is i386 vs i486+, and Old-Sparc vs New-Sparc. More generally, a
>>targer either cannot implement the builtin at all (a trivial fall back
>>using locks or no MT support at all) or can in no more than 1
>>non-trivial way. 
>>
>>
>Are you saying that you don't expect there to ever be an architecture
>that might have three or more ways of doing locking?  That seems rather
>optimistic to me.  I think we ought to plan for needing as many versions
>as we have CPUs, roughly speaking.
>  
>
Yes, in principle you are right, but in that case we can reorder the
ifs: first i686, last i386 ;) Seriously earlier today I was hoping we
can have something smarter than a series of conditionals at the level of
libgcc, I don't know it much. I was hoping we can manage to install a
version of it "knowing the target", so to speak.

>As for the specific IA32 case, I think you're suggesting that we ought
>to inline the sequence if (at least) -march=i486 is passed. If we
>currently use the same sequences for all i486 and higher processors,
>then that's a fine idea; there's not much point in making a library call
>to a function that will just do that one sequence, with the only benefit
>being that someone might later be able to replace that function if some
>as-of-yet uknown IA32 processor needs a different sequence.
>
Yes. My point, more correctly, is that, when -march=i486+ is passed,
*nothing* would change compared to what happens *now*. Again, I don't
know the details of Richard's implementation but the real new things
that we are discussing today kick-in when i386 (the default) is in effect.

Paolo.



Re: Call for compiler help/advice: atomic builtins for v3

2005-11-06 Thread Mark Mitchell
Paolo Carlini wrote:

> Yes, in principle you are right, but in that case we can reorder the
> ifs: first i686, last i386 ;) Seriously earlier today I was hoping we
> can have something smarter than a series of conditionals at the level of
> libgcc, I don't know it much. I was hoping we can manage to install a
> version of it "knowing the target", so to speak.

Yes, GLIBC does that kind of thing, and we could do.  In the simplest
form, we could have startup code that checks the CPU, and sets up a
table of function pointers that application code could use.

In principle, at least, this would be useful for other things, too; we
might want different versions of integer division routines for ARM, or
maybe, even, use real FP instructions in our software floating-point
routines, if we happened to be on a hardware floating-point processor.

-- 
Mark Mitchell
CodeSourcery, LLC
[EMAIL PROTECTED]
(916) 791-8304


Re: Call for compiler help/advice: atomic builtins for v3

2005-11-06 Thread Richard Henderson
On Sun, Nov 06, 2005 at 11:02:29AM -0800, Mark Mitchell wrote:
> Are you saying that you don't expect there to ever be an architecture
> that might have three or more ways of doing locking?  That seems rather
> optimistic to me.  I think we ought to plan for needing as many versions
> as we have CPUs, roughly speaking.

I think this is overkill.

> If we currently use the same sequences for all i486 and higher processors,
> then that's a fine idea;

This is pretty much true.

To keep all this in perspective, folks should remember that atomic
operations are *slow*.  Very very slow.  Orders of magnitude slower
than function calls.  Seriously.  Taking p4 as the extreme example,
one can expect a null function call in around 10 cycles, but a locked
memory operation to take 1000.  Usually things aren't that bad, but
I believe some poor design decisions were made for p4 here.  But even
on a platform without such problems you can expect a factor of 30
difference.


r~


Re: Call for compiler help/advice: atomic builtins for v3

2005-11-06 Thread Paolo Carlini
Mark Mitchell wrote:

>>Yes, in principle you are right, but in that case we can reorder the
>>ifs: first i686, last i386 ;) Seriously earlier today I was hoping we
>>can have something smarter than a series of conditionals at the level of
>>libgcc, I don't know it much. I was hoping we can manage to install a
>>version of it "knowing the target", so to speak.
>>
>>
>Yes, GLIBC does that kind of thing, and we could do.  In the simplest
>form, we could have startup code that checks the CPU, and sets up a
>table of function pointers that application code could use.
>
>In principle, at least, this would be useful for other things, too; we
>might want different versions of integer division routines for ARM, or
>maybe, even, use real FP instructions in our software floating-point
>routines, if we happened to be on a hardware floating-point processor.
>
Yeah

Paolo.



Re: Call for compiler help/advice: atomic builtins for v3

2005-11-06 Thread Mark Mitchell
Richard Henderson wrote:

> I believe some poor design decisions were made for p4 here.  But even
> on a platform without such problems you can expect a factor of 30
> difference.

So, that suggests that inlining these operations probably isn't very
profitable.  In that case, it seems like we could put these routines
into libgcc, and just have libstdc++ call them.  And, that if
__exchange_and_add is showing up on the top of the profile, the fix
probably isn't inlining -- it's to work out a way to make less use of
atomic operations.

-- 
Mark Mitchell
CodeSourcery, LLC
[EMAIL PROTECTED]
(916) 791-8304


Re: Call for compiler help/advice: atomic builtins for v3

2005-11-06 Thread Paolo Carlini
Mark Mitchell wrote:

>Richard Henderson wrote:
>
>>I believe some poor design decisions were made for p4 here.  But even
>>on a platform without such problems you can expect a factor of 30
>>difference.
>>
>>
>So, that suggests that inlining these operations probably isn't very
>profitable.  In that case, it seems like we could put these routines
>into libgcc, and just have libstdc++ call them.  And, that if
>__exchange_and_add is showing up on the top of the profile, the fix
>probably isn't inlining -- it's to work out a way to make less use of
>atomic operations.
>  
>
Indeed, I was about to reply to Richard the very same things. If we are
really sure that there is not much to gain from inlining (*), then the
libgcc idea is still valid, even more so, in a sense I care a lot:
working on the library will be *so* nice and the code so *clean*!

Paolo.

(*) You may dig some numbers from this thread:

http://gcc.gnu.org/ml/libstdc++/2004-02/msg00372.html

where I had to agree that we didn't give away too much performance.
Still, it's measurable.


Re: Call for compiler help/advice: atomic builtins for v3

2005-11-06 Thread Paolo Carlini
Paolo Carlini wrote:

>>And, that if
>>__exchange_and_add is showing up on the top of the profile, the fix
>>probably isn't inlining -- it's to work out a way to make less use of
>>atomic operations.
>>
>>
I want to add that we are certainly going in this direction, with a
non-refcounted string for the next library ABI, also available as an
extension, a preview for 4.1.

On the other hand, you cannot avoid atomic operations completely in the
library, and in some areas, for instance tr1::shared_ptr, it's really
the best you can use, using lock-free programming, to achieve MT-safety
and good performance.

Also, I think an user yesterday had a good point: we should also make
possible a configuration option enabling to build the library optimized
for single thread. To be fair, Gerald Pfeifer also suggested that a lot
of time ago...

Paolo.


Re: Call for compiler help/advice: atomic builtins for v3

2005-11-06 Thread Ian Lance Taylor
Richard Henderson <[EMAIL PROTECTED]> writes:

> Not all targets are going to be able to implement the builtins,
> even with locks.  It is imperitive that the target have an 
> atomic store operation, so that other read-only references to
> the variable see either the old or new value, but not a mix.

How many processors out there support multi-processor systems but do
not provide any sort of atomic store operation?

On a uni-processor system, presumably we only have to worry about
preemptive thread switching and signals.  Appropriate locking
mechanisms are available for both.

I can see that there is a troubling case that code may be compiled for
i386 and then run on a multi-processing system using newer processors.
That is something which we would have to detect at run time, in start
up code or the first time the builtins are invoked.

But it's not like I've looked at this stuff all that much.  In
general, what is the worst case we have to worry about?

Ian


Re: Call for compiler help/advice: atomic builtins for v3

2005-11-06 Thread Paolo Carlini
Hi Ian,

>I can see that there is a troubling case that code may be compiled for
>i386 and then run on a multi-processing system using newer processors.
>That is something which we would have to detect at run time, in start
>up code or the first time the builtins are invoked.
>  
>
Earlier in this thread, Mark figured out a very nice solution for this
problem: a target-aware libgcc. Apparently glibc is already using a
similar approach.

Paolo.


Re: Call for compiler help/advice: atomic builtins for v3

2005-11-06 Thread Florian Weimer
* Paolo Carlini:

> Actually, the situation is not as bad, as far as I can see: the worst
> case is i386 vs i486+, and Old-Sparc vs New-Sparc. More generally, a
> targer either cannot implement the builtin at all (a trivial fall back
> using locks or no MT support at all) or can in no more than 1
> non-trivial way.

Isn't IA32 a counter-example for this?  Non-SMP or non-threaded
(normal INC/DEC), SMP (LOCK INC/DEC), and modern SMP (something else)
already results in three choices.


Re: Call for compiler help/advice: atomic builtins for v3

2005-11-06 Thread Peter Dimov

Richard Henderson wrote:


To keep all this in perspective, folks should remember that atomic
operations are *slow*.  Very very slow.  Orders of magnitude slower
than function calls.  Seriously.  Taking p4 as the extreme example,
one can expect a null function call in around 10 cycles, but a locked
memory operation to take 1000.  Usually things aren't that bad, but
I believe some poor design decisions were made for p4 here.  But even
on a platform without such problems you can expect a factor of 30
difference.


Apologies in advance if the following is not relevant...

Even on a P4, inlining may enable compiler optimizations. One case is when 
the compiler can see that the return value of __sync_fetch_and_or (for 
instance) isn't used. It's possible to use a wait-free "lock or" instead of 
a "lock cmpxchg" loop (MSVC 8 does this for _InterlockedOr.)


Another case is when inlining results in a sequence of K adjacent 
__sync_fetch_and_add( &x, 1 ) operations. These can legally be replaced with 
a single __sync_fetch_and_add.


Currently the __sync_* intrinsics seem to be fully locked, but if 
acquire/release/unordered variants are added, other platforms may also 
suffer from lack of inlining. On a PowerPC an unordered atomic increment is 
pretty much the same speed as an ordinary increment (when there is no 
contention.) 



Re: Call for compiler help/advice: atomic builtins for v3

2005-11-06 Thread Florian Weimer
* Richard Henderson:

> To keep all this in perspective, folks should remember that atomic
> operations are *slow*.  Very very slow.  Orders of magnitude slower
> than function calls.  Seriously.  Taking p4 as the extreme example,
> one can expect a null function call in around 10 cycles, but a locked
> memory operation to take 1000.  Usually things aren't that bad, but
> I believe some poor design decisions were made for p4 here.  But even
> on a platform without such problems you can expect a factor of 30
> difference.

And, as far as I know, you take this performance hit even if you
aren't running SMP and could use an ordinary read-modify-write
instruction instead.


Re: Call for compiler help/advice: atomic builtins for v3

2005-11-06 Thread Richard Henderson
On Sun, Nov 06, 2005 at 12:10:03PM -0800, Ian Lance Taylor wrote:
> How many processors out there support multi-processor systems but do
> not provide any sort of atomic store operation?

My point here had been more wrt the 8-byte operations, wherein
there are *pleanty* of multi-processor systems that don't have
an 8-byte atomic store.

Although in some cases we can make use of the fpu for that.


r~


Re: Call for compiler help/advice: atomic builtins for v3

2005-11-06 Thread Florian Weimer
* Peter Dimov:

> Even on a P4, inlining may enable compiler optimizations. One case is when 
> the compiler can see that the return value of __sync_fetch_and_or (for 
> instance) isn't used. It's possible to use a wait-free "lock or" instead of 
> a "lock cmpxchg" loop (MSVC 8 does this for _InterlockedOr.)

You don't need inlining to optimize these cases.  You only need to
know precisely what the library implementations do, and you need a
couple of choices.  GCC can already optimizes

  printf ("hello world\n");

to

  puts ("hello world");

even though no inlining takes place.


Re: Call for compiler help/advice: atomic builtins for v3

2005-11-06 Thread Paolo Carlini

Richard Henderson wrote:

On Sun, Nov 06, 2005 at 12:10:03PM -0800, Ian Lance Taylor wrote:
  

How many processors out there support multi-processor systems but do
not provide any sort of atomic store operation?


My point here had been more wrt the 8-byte operations, wherein
there are *pleanty* of multi-processor systems that don't have
an 8-byte atomic store.

Although in some cases we can make use of the fpu for that.
  
Indeed, I hope we can deal with that, at least on most 64-bit machines: 
we have got a nasty bug in the mt_allocator code (libstdc++/24469) which 
would be trivial to fix if size_t-sized atomics were generally available...


Paolo.


Re: Call for compiler help/advice: atomic builtins for v3

2005-11-06 Thread Ulrich Drepper
Mark Mitchell wrote:
> Yes, GLIBC does that kind of thing, and we could do.  In the simplest
> form, we could have startup code that checks the CPU, and sets up a
> table of function pointers that application code could use.

That's not what glibc does and it is a horrible idea.  The indirect
jumps are costly, very much so.  The longer the pipeline the worse.

The best solution (for Linux) is to compile multiple versions of the DSO
and place them in the correct places so that the dynamic linker finds
them if the system has the right functionality.  Code generation issues
aside, this is really only needed for atomic ops and maybe vector
operations (XMM, Altivec, ...).  The number of configurations really
needed and important is small (as is the libstdc++ binary).

So, just make sure that an appropriate configure line can be given
(i.e., add --enable-xmm flags or so) and make it possible to compile
multiple libstdc++ without recompiling the whole gcc.  Packagers can
then compile packages with multiple libstdc++.

Pushing some of the operations (atomic ops) to libgcc seems sensible but
others (like vector ops) are libstdc++ specific and therefore splitting
the functionality between libstdc++ and libgcc means requiring even more
versions since then libgcc also would have to be available in multiple
versions.


And note that for Linux the atomic ops need to take arch-specific
extensions into account.  For ppc it'll likely mean using the vDSO.

-- 
➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖



signature.asc
Description: OpenPGP digital signature


Re: Call for compiler help/advice: atomic builtins for v3

2005-11-06 Thread Mark Mitchell
Ulrich Drepper wrote:
> Mark Mitchell wrote:
> 
>>Yes, GLIBC does that kind of thing, and we could do.  In the simplest
>>form, we could have startup code that checks the CPU, and sets up a
>>table of function pointers that application code could use.
> 
> 
> That's not what glibc does and it is a horrible idea.  The indirect
> jumps are costly, very much so.  The longer the pipeline the worse.

I didn't mean to imply that GLIBC uses the simplistic solution I
suggested, and, certainly, dynamic linking is going to be better on
ssytems that support it.  The simplistic solution was meant to be
illustrative, and might be appropriate for use on systems without
dynamic linking, like bare-metal embedded configurations, where,
however, the exact CPU isn't know at link-time.

-- 
Mark Mitchell
CodeSourcery, LLC
[EMAIL PROTECTED]
(916) 791-8304


Re: Call for compiler help/advice: atomic builtins for v3

2005-11-06 Thread Richard Henderson
On Sun, Nov 06, 2005 at 10:51:51AM -0800, Richard Henderson wrote:
> I suppose that in some cases it would be possible to implement
> them in libgcc.  Certainly we provided for that possibility 
> by expanding to external calls.

Actually, no, it's not possible.  At least in the context we're
discussing here.  Consider:

One part of the application (say, libstdc++) is compiled with only
i386 support.  Here we wind up relying on a mutex to protect the
memory update.  Another part of the application (say, the exe) is
compiled with i686 support, and so chooses to use atomic operations.
The application will now fail because not all updates to the memory
location are protected by the mutex.

The use of a mutex (or not) is part of the API associated with the
memory location.  If it's ever used by one part of the application,
it must be used by all.

The only thing we can do for libstdc++ is to put something in the 
config files that claims that atomic operations are *always*
available for all members of the architecture family.  If so, we
may inline them.  Otherwise, we should force the routines to be
out-of-line in the library.  This provides the stable ABI that we
require.  We may then use optimization flags to either implement
the operation via atomic operations, or a mutex.

We cannot provide fallbacks for the sync builtins in libgcc, because
at that level we do not have enough information about the API required
by the user.


r~


Re: Call for compiler help/advice: atomic builtins for v3

2005-11-06 Thread Paolo Carlini
Richard Henderson wrote:

>One part of the application (say, libstdc++) is compiled with only
>i386 support.  Here we wind up relying on a mutex to protect the
>memory update.  Another part of the application (say, the exe) is
>compiled with i686 support, and so chooses to use atomic operations.
>The application will now fail because not all updates to the memory
>location are protected by the mutex.
>
>The use of a mutex (or not) is part of the API associated with the
>memory location.  If it's ever used by one part of the application,
>it must be used by all.
>  
>
Ok, I think I get this point: you cannot link together two parts of the
application "sharing" a memory location and protecting the accesses in
different ways. Seems clear now. Very similar to the issue you have if
you try to allocate with a memory allocator and deallocate with another
in different parts of the application.

>The only thing we can do for libstdc++ is to put something in the 
>config files that claims that atomic operations are *always*
>available for all members of the architecture family.  If so, we
>may inline them.  Otherwise, we should force the routines to be
>out-of-line in the library.  This provides the stable ABI that we
>require.
>
I think I get this too. This "quid"  in the config files basically is
info that we already have easily available: powerpc -> ok; alpha -> ok;
i?86 -> not ok; Sparc -> not ok, and so on (*) ... We only need to use
it in the configury of the library and either call the builtins or not
depending on the architecture family. We have to add to the library
out-of-line versions of the builtins... (in order to do that, we may end
up restoring the old inline assembly implementations of CAS, for example)

>  We may then use optimization flags to either implement
>the operation via atomic operations, or a mutex.
>  
>
If I understand correctly, this is what we are already doing in the
*.so, for i386 vs i486+. I would not call that "optimization flag",
however. Can you clarify?

Therefore, all in all, if I understand correctly, we:
1- Don't need new preprocessor builtins.
2- Don't need new libgcc code.
3- Do need out-of-line library versions of the various builtins for the
architecture families that don't implement uniformly the builtins (a lot
of work!).
4- Do need a bit of additional configury (bit of work).

Paolo.

(*) I think there is an annoying special case which is multilib x86_64:
in its 32-bit version belongs to the i?86 family and, as far as I can
see, we cannot inline the builtins for x86_64 tout court, or?!?


Re: Call for compiler help/advice: atomic builtins for v3

2005-11-06 Thread Richard Henderson
On Mon, Nov 07, 2005 at 01:35:13AM +0100, Paolo Carlini wrote:
> We have to add to the library
> out-of-line versions of the builtins... (in order to do that, we may end
> up restoring the old inline assembly implementations of CAS, for example)

I don't think you need to restore inline assembly.

> If I understand correctly, this is what we are already doing in the
> *.so, for i386 vs i486+. I would not call that "optimization flag",
> however. Can you clarify?

I'm not sure how you were previously controling what went in here.
By configuration name?  That's certainly one way to do it, and 
probably the most reliable.

Another method is to use -march=i486 on the command line, and from
there use the __i486__ defines already present to determine what
to do.  Note that, at least for x86, -mtune=cpu affects __tune_cpu__,
but not __cpu__.

My thinking would be along the lines of


#if !ARCH_ALWAYS_HAS_SYNC_BUILTINS
_Atomic_word
__exchange_and_add(volatile _Atomic_word* __mem, int __val)
{
#if ARCH_HAS_SYNC_BUILTINS
  return __sync_fetch_and_add(__mem, __val);
#else
  __gthread_mutex_lock (&mutex);
  _Atomic_word ret = *__mem;
  *__mem = ret + __val;
  __gthread_mutex_unlock (&mutex);
  return ret;
#endif
}

void
__atomic_add(volatile _Atomic_word* __mem, int __val)
{
#if ARCH_HAS_SYNC_BUILTINS
  __sync_fetch_and_add(__mem, __val);
#else
  __gthread_mutex_lock (&mutex);
  *__mem += __val;
  __gthread_mutex_unlock (&mutex);
#endif
}
#endif

This definition applies identically to *all* platforms.

For x86, the config file would have

#define ARCH_ALWAYS_HAS_SYNC_BUILTINS 0
#define ARCH_HAS_SYNC_BUILTINS \
  (__i486__ || __i586__ || __i686__ || __k6__ || __athlon__ || \
   __k8__ || __pentium4__ || __nocona__)

We then arrange for the appropriate -march= -mtune= combination
to be set by the configury.  You might want to examine what I'm
doing with this kind of thing for libgomp.


r~


Re: Call for compiler help/advice: atomic builtins for v3

2005-11-06 Thread Paolo Carlini
Richard Henderson wrote:

>On Mon, Nov 07, 2005 at 01:35:13AM +0100, Paolo Carlini wrote:
>  
>
>>We have to add to the library
>>out-of-line versions of the builtins... (in order to do that, we may end
>>up restoring the old inline assembly implementations of CAS, for example)
>>
>>
>I don't think you need to restore inline assembly.
>  
>
Would be a dirty short-cut... ;)

>>If I understand correctly, this is what we are already doing in the
>>*.so, for i386 vs i486+. I would not call that "optimization flag",
>>however. Can you clarify?
>>
>>
>I'm not sure how you were previously controling what went in here.
>By configuration name?
>
Yes, see configure.host

>  That's certainly one way to do it, and 
>probably the most reliable.
>  
>
Ok, thanks.

>Another method is to use -march=i486 on the command line, and from
>there use the __i486__ defines already present to determine what
>to do.  Note that, at least for x86, -mtune=cpu affects __tune_cpu__,
>but not __cpu__.
>
>My thinking would be along the lines of
>  
>
[snip]

Ok, thanks. That is also by and large what I had in mind, modulo I would
exploit our current infrastructure that you can ascertain looking to
configure.host.

To be sure: can you confirm that there is no easy solution for the
x86_64 issue? I mean, it's annoying that we cannot inline the builtins
for i686, but even more so for x86_64...

Paolo.


Re: Call for compiler help/advice: atomic builtins for v3

2005-11-06 Thread Paolo Carlini
Richard Henderson wrote:

>My thinking would be along the lines of
>
>
>#if !ARCH_ALWAYS_HAS_SYNC_BUILTINS
>  
>
[snip]

>#endif
>  
>
Well, there is a minor catch, which is, if we don't want to break the
ABI, we have to keep on implementing and exporting from the *.so
__exchange_and_add and __atomic_add also for arches that actually always
support sync builtins (even if new code will have the builtins expanded
inline)

Paolo.


Does gcc-3.4.3 for HP-UX 11.23/IA-64 work?

2005-11-06 Thread Albert Chin
I've built gcc-3.4.3 for HP-UX 11.23/IA-64 and used the pre-compiled
gcc-3.4.4 binary from the http://www.hp.com/go/gcc site. Both exhibit
the same problem. While trying to build Perl 5.8.6:
  $ gmake
  ...
  gcc -v -o libperl.so -shared -fPIC perl.o  gv.o toke.o perly.o op.o
pad.o regcomp.o dump.o util.o mg.o reentr.o hv.o av.o run.o pp_hot.o
sv.o pp.o scope.o pp_ctl.o pp_sys.o doop.o doio.o regexec.o utf8.o
taint.o deb.o universal.o xsutils.o globals.o perlio.o perlapi.o
numeric.o locale.o pp_pack.o pp_sort.o  -lcl -lnsl -lnm -ldl -ldld -lm
-lsec -lpthread -lc
  ...
 /opt/TWWfsw/gcc343/libexec/gcc/ia64-hp-hpux11.23/3.4.3/collect2
+Accept TypeMismatch -b -o libperl.so -L/opt/TWWfsw/gcc343r/lib
-L/opt/TWWfsw/gcc343/lib/gcc/ia64-hp-hpux11.23/3.4.3 -L/usr/ccs/bin
-L/usr/ccs/lib
-L/opt/TWWfsw/gcc343/lib/gcc/ia64-hp-hpux11.23/3.4.3/../../.. perl.o
gv.o toke.o perly.o op.o pad.o regcomp.o dump.o util.o mg.o reentr.o
hv.o av.o run.o pp_hot.o sv.o pp.o scope.o pp_ctl.o pp_sys.o doop.o
doio.o regexec.o utf8.o taint.o deb.o universal.o xsutils.o globals.o
perlio.o perlapi.o numeric.o locale.o pp_pack.o pp_sort.o -lcl -lnsl
-lnm -ldl -ldld -lm -lsec -lpthread -lc -lgcc -lgcc
^^^

Notice the "-lgcc -lgcc" at the end of the collect2 command-line,
_not_ "-lgcc_s -lgcc_s".

On HP-UX 11.23/PA-RISC, I get:
  /opt/TWWfsw/gcc343/libexec/gcc/hppa2.0-hp-hpux11.23/3.4.3/collect2
-z -b -o libperl.sl -L/opt/TWWfsw/gcc343r/lib
-L/opt/TWWfsw/gcc343r/lib
-L/opt/TWWfsw/gcc343/lib/gcc/hppa2.0-hp-hpux11.23/3.4.3 -L/usr/ccs/bin
-L/usr/ccs/lib -L/opt/langtools/lib -L/opt/TWWfsw/gcc343/lib perl.o
gv.o toke.o perly.o op.o pad.o regcomp.o dump.o util.o mg.o reentr.o
hv.o av.o run.o pp_hot.o sv.o pp.o scope.o pp_ctl.o pp_sys.o doop.o
doio.o regexec.o utf8.o taint.o deb.o universal.o xsutils.o globals.o
perlio.o perlapi.o numeric.o locale.o pp_pack.o pp_sort.o -lcl -lnsl
-lnm -lmalloc -ldld -lm -lcrypt -lsec -lpthread -lc -lgcc_s -lgcc_s

Using the HP pre-compiled binary of gcc-4.0.2, I get:
 /opt/hp-gcc/4.0.2/bin/../libexec/gcc/ia64-hp-hpux11.23/4.0.2/collect2
-z +Accept TypeMismatch -b -o libperl.so
-L/opt/hp-gcc/4.0.2/bin/../lib/gcc/ia64-hp-hpux11.23/4.0.2
-L/opt/hp-gcc/4.0.2/bin/../lib/gcc
-L/opt/hp-gcc/4.0.2//lib/gcc/ia64-hp-hpux11.23/4.0.2 -L/usr/ccs/bin
-L/usr/ccs/lib
-L/opt/hp-gcc/4.0.2/bin/../lib/gcc/ia64-hp-hpux11.23/4.0.2/../../..
-L/opt/hp-gcc/4.0.2//lib/gcc/ia64-hp-hpux11.23/4.0.2/../../.. perl.o
gv.o toke.o perly.o op.o pad.o regcomp.o dump.o util.o mg.o reentr.o
hv.o av.o run.o pp_hot.o sv.o pp.o scope.o pp_ctl.o pp_sys.o doop.o
doio.o regexec.o utf8.o taint.o deb.o universal.o xsutils.o globals.o
perlio.o perlapi.o numeric.o locale.o pp_pack.o pp_sort.o -lcl -lnsl
-lnm -ldl -ldld -lm -lsec -lpthread -lc -lgcc_s -lunwind -lgcc_s
-lunwind

The "*libgcc" line from the 3.4.3/3.4.4 specs file:
  *libgcc:
  %{shared-libgcc:%{!mlp64:-lgcc_s}%{mlp64:-lgcc_s_hpux64} 
%{static|static-libgcc:-lgcc -lgcc_eh 
-lunwind}%{!static:%{!static-libgcc:%{!shared:%{!shared-libgcc:-lgcc -lgcc_eh 
-lunwind}%{shared-libgcc:-lgcc_s%M -lunwind -lgcc}}%{shared:-lgcc_s%M 
-lunwind%{!shared-libgcc:-lgcc}

The "*libgcc" line from the 4.0.2 specs file (via -dumpspecs):
  *libgcc:
  %{static|static-libgcc:-lgcc -lgcc_eh 
-lunwind}%{!static:%{!static-libgcc:%{!shared:%{!shared-libgcc:-lgcc -lgcc_eh 
-lunwind}%{shared-libgcc:-lgcc_s -lunwind -lgcc}}%{shared:-lgcc_s -lunwind}}}

Is the problem in the "*libgcc" entry? It seems !shared-libgcc is
true, though I don't know why.
  $ /opt/TWWfsw/gcc343/bin/gcc -v
  Reading specs from
  /opt/TWWfsw/gcc343/lib/gcc/ia64-hp-hpux11.23/3.4.3/specs
  Configured with: /opt/build/gcc-3.4.3/configure --with-gnu-as
  --with-as=/opt/TWWfsw/gcc343/ia64-hp-hpux11.23/bin/as
  --with-included-gettext --enable-shared
  --datadir=/opt/TWWfsw/gcc343/share --enable-languages=c,c++,f77
  --with-local-prefix=/opt/TWWfsw/gcc343 --prefix=/opt/TWWfsw/gcc343
  Thread model: single
  gcc version 3.4.3 (TWW)

-- 
albert chin ([EMAIL PROTECTED])


Re: Call for compiler help/advice: atomic builtins for v3

2005-11-06 Thread Paolo Carlini
Richard Henderson wrote:

>Actually, no, it's not possible.  At least in the context we're
>discussing here.  Consider:
>
>One part of the application (say, libstdc++) is compiled with only
>i386 support.  Here we wind up relying on a mutex to protect the
>memory update.  Another part of the application (say, the exe) is
>compiled with i686 support, and so chooses to use atomic operations.
>The application will now fail because not all updates to the memory
>location are protected by the mutex.
>  
>
Richard, sorry, I don't agree, on second thought. You are not
considering that the idea is using a "smart" libgcc, a la glibc, as per
Mark and Uli messages.

A "libstdc++ compiled with only i386 support" what is it? It is a
libstdc++ which at run time will call into libgcc, it has nothing
inline. Then libgcc will use the best the machine has available, that
is, certainly atomic operations, if the exe (compiled with -march=i686)
can possibly run.

In short, the keys are:
1- The "smart" libgcc, which always makes available the best the machine.
2- Mutexes cannot be inline, only atomic operations can.

Paolo.


Re: [RFC] Enabling loop unrolls at -O3?

2005-11-06 Thread Gabriel Dos Reis
"Giovanni Bajo" <[EMAIL PROTECTED]> writes:

| Gabriel Dos Reis <[EMAIL PROTECTED]> wrote:
| 
| >>> You must not have been paying attention to one of the most frequent
| >>> complaints about gcc, which is that it is dog slow already ;-)
| >>
| >> Sure, but to me -O2 says you don't care much about compilation time.
| >
| > If the Ada front-end wishes, it can make special flags for its own
| > needs...
| 
| 
| Why are you speaking of the Ada frontend?

http://gcc.gnu.org/ml/gcc/2005-11/msg00262.html

-- Gaby


no warning being displayed.

2005-11-06 Thread Inder
Hi all,
I am compiling small program on a SPARC gcc 3.4.3

test.c
---

struct test1
{
 int a;
 int b;
 char c;
};

struct test2
 {
 char a;
 char b;
 char c;
};

struct test3
{
 int a;
 int b;
 int c;
};

int main()
{
 struct test1* t1, t11;
 struct test2* t2 ;
 struct test3* t3;

 t1 = &t11;
 t2 = (struct t2*)t1;
 t3 = (struct t3*)t1;
 return 0;
}


I suppose such an assignment should give a warning
"incompatible pointer type" but when compiling with
gcc 3.4.3 no such warning is given even with -Wall enabled.

Is this a bug in this version?

GCC version used
---
 ./cc1 --version
GNU C version 3.4.3 (sparc-elf)
compiled by GNU C version 3.2.3 20030502 (Red Hat Linux 3.2.3-20).
GGC heuristics: --param ggc-min-expand=100 --param ggc-min-heapsize=131072


--
Thanks,
Inder