Re: 32 bit jump instruction.

2006-12-13 Thread Joern Rennecke
In http://gcc.gnu.org/ml/gcc/2006-12/msg00328.html, you wrote:


>> On 06 Dec 2006 23:13:35 -0800, Ian Lance Taylor <[EMAIL PROTECTED]> wrote:
>> If you can't afford to lose a register, then I think your only option
>> is to pick some callee-saved register and have each branch instruction
>> explicitly clobber it.  Then it will be available for use in a long
>> branch, and it will be available for use within a basic block.  This
>> is far from ideal, but I don't know a better way to handle it within
>> gcc's current framework.

> Can i get more clarity on this part. Is it implemented in any other backends?

> When you say "pick some callee-saved register ", is it to pick them
> randomly from an available set in CALL_USED_REGISTERS or a specific
> register.

The SH does register scavenging, and sharing of far branches.  Look at
config/sh/sh.c:split_branches .  Also see PR 29336 for how this could be better
integrated with machine-specific constant pool placement.

However, because the SH has delayed branches, there is always a guaranteed way
to find a register - one can be saved, and then be restored in the delay slot.
An architecture without delay slots would have to have another fallback
mechanism, e.g. inserting a register restore before the target - possibly with
a short jump around it, duplicate instructions from the target till a register
dies, or inserting a register restore and jump in the vincinity of the target.


Re: 32 bit jump instruction.

2006-12-13 Thread Steven Bosscher

On 12/13/06, Joern Rennecke <[EMAIL PROTECTED]> wrote:

In http://gcc.gnu.org/ml/gcc/2006-12/msg00328.html, you wrote:
However, because the SH has delayed branches, there is always a guaranteed way
to find a register - one can be saved, and then be restored in the delay slot.


Heh, that's an interesting feature :-)

How does that work?  I always thought that the semantics of delayed
insns is that the insn in the delay slot is executed *before* the
branch. But that is apparently not the case, or the branch register
would have been over-written before the branch. How does that work on
SH?

Gr.
Steven


Memory allocation for local variables.

2006-12-13 Thread Sandeep Kumar

Hi all,
I tried compiling the above two programs :
on x86, 32 bit machines.

And when I used objdump on that I saw the following code.
Can anyone help me know,
Why in the objdump of our first program the esp is decremented by 18H bytes
and in the second program the esp is decremented by 28H bytes.

How actually is teh memory allocated by gcc for local variables.

Kindly help.

int main()
{
 char x;
 return 0;

}


[EMAIL PROTECTED] ~]# gcc test.c
[EMAIL PROTECTED] ~]# objdump -S a.out | less
08048348 :
8048348:   55  push   %ebp
8048349:   89 e5mov%esp,%ebp
804834b:   83 ec 18   sub$0x18,%esp
804834e:   83 e4 f0and$0xfff0,%esp
8048351:   b8 00 00 00 00   mov$0x0,%eax
8048356:   83 c0 0fadd$0xf,%eax
8048359:   83 c0 0fadd$0xf,%eax
804835c:   c1 e8 04   shr$0x4,%eax
804835f:   c1 e0 04shl$0x4,%eax
8048362:   29 c4sub%eax,%esp
8048364:   b8 00 00 00 00mov$0x0,%eax
8048369:   c9  leave
804836a:   c3  ret
804836b:   90  nop

int main()
{
 double x,y,z;
 char p,q,r;
 return 0;

}

08048348 :
8048348:   55  push   %ebp
8048349:   89 e5   mov%esp,%ebp
804834b:   83 ec 28sub$0x28,%esp
804834e:   83 e4 f0and$0xfff0,%esp
8048351:   b8 00 00 00 00  mov$0x0,%eax
8048356:   83 c0 0fadd$0xf,%eax
8048359:   83 c0 0fadd$0xf,%eax
804835c:   c1 e8 04shr$0x4,%eax
804835f:   c1 e0 04shl$0x4,%eax
8048362:   29 c4   sub%eax,%esp
8048364:   b8 00 00 00 00  mov$0x0,%eax
8048369:   c9  leave
804836a:   c3  ret
804836b:   90  nop




--
Regards,
Sandeep






If the facts don't fit the theory, change the facts.


g++ doesn't unroll a loop it should unroll

2006-12-13 Thread Benoît Jacob
Hi,

I'm developing a Free C++ template library (1) in which it is very important 
that certain loops get unrolled, but at the same time I can't unroll them by 
hand, because they depend on template parameters.

My problem is that G++ 4.1.1 (Gentoo) doesn't unroll these loops.

I have written a standalone simple program showing this problem; I attach it 
(toto.cpp) and I also paste it below. This program does a loop if UNROLL is 
not defined, and does the same thing but with the loop unrolled by hand if 
UNROLL is defined. So one would expect that with g++ -O3, the speed would be 
the same in both cases. Alas, it's not:

g++ -DUNROLL -O3 toto.cpp -o toto   ---> toto runs in 0.3 seconds
g++ -O3 toto.cpp -o toto---> toto runs in 1.9 seconds

So what can I do? Is that a bug in g++? If yes, any hope to see it fixed soon?

Cheers,
Benoit

(1) : Eigen, see http://eigen.tuxfamily.org


file: toto.cpp


#include

class Matrix
{
public:
double data[9];
double & operator()( int i, int j )
{
return data[i + 3 * j];
}
void loadScaling( double factor );
};

void Matrix::loadScaling( double factor)
{
#ifdef UNROLL
(*this)( 0, 0 ) = factor;
(*this)( 1, 0 ) = 0;
(*this)( 2, 0 ) = 0;
(*this)( 0, 1 ) = 0;
(*this)( 1, 1 ) = factor;
(*this)( 2, 1 ) = 0;
(*this)( 0, 2 ) = 0;
(*this)( 1, 2 ) = 0;
(*this)( 2, 2 ) = factor;
#else
for( int i = 0; i < 3; i++ )
for( int j = 0; j < 3; j++ )
(*this)(i, j) = (i == j) * factor;
#endif
}

int main( int argc, char *argv[] )
{
Matrix m;
for( int i = 0; i < 1; i++ )
m.loadScaling( i );
std::cout << "m(0,0) = " << m(0,0) << std::endl;
}
#include

class Matrix
{
public:
double data[9];
double & operator()( int i, int j )
{
return data[i + 3 * j];
}
void loadScaling( double factor );
};

void Matrix::loadScaling( double factor)
{
#ifdef UNROLL
(*this)( 0, 0 ) = factor;
(*this)( 1, 0 ) = 0;
(*this)( 2, 0 ) = 0;
(*this)( 0, 1 ) = 0;
(*this)( 1, 1 ) = factor;
(*this)( 2, 1 ) = 0;
(*this)( 0, 2 ) = 0;
(*this)( 1, 2 ) = 0;
(*this)( 2, 2 ) = factor;
#else
for( int i = 0; i < 3; i++ )
for( int j = 0; j < 3; j++ )
(*this)(i, j) = (i == j) * factor;
#endif
}

int main( int argc, char *argv[] )
{
Matrix m;
for( int i = 0; i < 1; i++ )
m.loadScaling( i );
std::cout << "m(0,0) = " << m(0,0) << std::endl;
}


pgpWZeXGqxnTe.pgp
Description: PGP signature


Re: 32 bit jump instruction.

2006-12-13 Thread amylaar
Quoting Steven Bosscher <[EMAIL PROTECTED]>:

> On 12/13/06, Joern Rennecke <[EMAIL PROTECTED]> wrote:
> > In http://gcc.gnu.org/ml/gcc/2006-12/msg00328.html, you wrote:
> > However, because the SH has delayed branches, there is always a guaranteed
> way
> > to find a register - one can be saved, and then be restored in the delay
> slot.
>
> Heh, that's an interesting feature :-)
>
> How does that work?  I always thought that the semantics of delayed
> insns is that the insn in the delay slot is executed *before* the
> branch. But that is apparently not the case, or the branch register
> would have been over-written before the branch. How does that work on
> SH?

The jump address is calculated, then the delay slot instruction is
executed - or sometimes, if the instructions are pairable, the delay
slot insn is executed simultanously with the jump address calculations,
then - or during the delay slot insn execution - the target instruction
is fetched, and then executed.  You can look into sim/sh/interp.c for
a functional model of how this works from the programmer's point of view.


Re: g++ doesn't unroll a loop it should unroll

2006-12-13 Thread Steven Bosscher

On 12/13/06, Benoît Jacob <[EMAIL PROTECTED]> wrote:

g++ -DUNROLL -O3 toto.cpp -o toto   ---> toto runs in 0.3 seconds
g++ -O3 toto.cpp -o toto---> toto runs in 1.9 seconds

So what can I do? Is that a bug in g++? If yes, any hope to see it fixed soon?


You could try adding -funroll-loops.

Gr.
Steven


Re: Memory allocation for local variables.

2006-12-13 Thread Steven Bosscher

On 12/13/06, Sandeep Kumar <[EMAIL PROTECTED]> wrote:

Hi all,
I tried compiling the above two programs :
on x86, 32 bit machines.
[EMAIL PROTECTED] ~]# gcc test.c


Try with optimization enabled (try -O1 and/or -O2).

Gr.
Steven


Re: g++ doesn't unroll a loop it should unroll

2006-12-13 Thread Benoît Jacob
I had already tried that. That doesn't change anything.

I had also tried passing a higher --param max-unroll-times. No effect.

So, any idea? The example program toto.cpp is so simple, I can't believe g++ 
can't handle it. Surely there must be something simple that I haven't 
understood?

Benoit

Le mercredi 13 décembre 2006 13:12, Steven Bosscher a écrit :
> On 12/13/06, Benoît Jacob <[EMAIL PROTECTED]> wrote:
> > g++ -DUNROLL -O3 toto.cpp -o toto   ---> toto runs in 0.3 seconds
> > g++ -O3 toto.cpp -o toto---> toto runs in 1.9 seconds
> >
> > So what can I do? Is that a bug in g++? If yes, any hope to see it fixed
> > soon?
>
> You could try adding -funroll-loops.
>
> Gr.
> Steven


pgpsNhllyNHF0.pgp
Description: PGP signature


Re: libffi compilation failure on Solaris 10?

2006-12-13 Thread Stewart Stremler
begin  quoting Eric Botcazou as of Thu, Nov 30, 2006 at 10:05:21PM +0100:
[snip]
> ... if the user is trying to link objects files assembled by the GNU assembler
> using the Sun linker.

...which seems to be the case. Even when configure is told to use gld,
somewhere down the line, gcc eventually decides that it should use
/usr/ccs/bin/ld instead.

http://www.stremler.net/temp/GST/try3.txt

I see that 2.3.1 is out, but I'm wondering if I need to start with
getting a later version of GCC and friends.

-- 
Stewart Stremler


Re: g++ doesn't unroll a loop it should unroll

2006-12-13 Thread Ian Lance Taylor
Benoît Jacob <[EMAIL PROTECTED]> writes:

> I'm developing a Free C++ template library (1) in which it is very important 
> that certain loops get unrolled, but at the same time I can't unroll them by 
> hand, because they depend on template parameters.
> 
> My problem is that G++ 4.1.1 (Gentoo) doesn't unroll these loops.
> 
> I have written a standalone simple program showing this problem; I attach it 
> (toto.cpp) and I also paste it below. This program does a loop if UNROLL is 
> not defined, and does the same thing but with the loop unrolled by hand if 
> UNROLL is defined. So one would expect that with g++ -O3, the speed would be 
> the same in both cases. Alas, it's not:

When I try it, gcc does unroll the loops.  It completely unrolls the
inner loop, but only partially unrolls the outer loop.  The reason it
doesn't completely unroll the outer loop is simply that gcc doesn't
attempt to completely unroll loops which contain inner loops.

This could probably be fixed: we could probably completely unroll a
loop if all its inner loop were completely unrolled.  I encourage you
to file a bug report.  See http://gcc.gnu.org/bugs.html.

Ian


Re: g++ doesn't unroll a loop it should unroll

2006-12-13 Thread Benoit Jacob

On Wed, 13 Dec 2006, Ian Lance Taylor wrote:

When I try it, gcc does unroll the loops.  It completely unrolls the
inner loop, but only partially unrolls the outer loop.  The reason it
doesn't completely unroll the outer loop is simply that gcc doesn't
attempt to completely unroll loops which contain inner loops.


OK, I don't have the skill to check the binary code. All I know is that 
with UNROLL defined it runs more than 5x faster, so there is room for 
improvement :)



This could probably be fixed: we could probably completely unroll a
loop if all its inner loop were completely unrolled.  I encourage you
to file a bug report.  See http://gcc.gnu.org/bugs.html.


OK: I didn't dare decide on my own that it's a bug, but if you say so... I 
file a bug report now.


Benoit


Re: Serious SPEC CPU 2006 FP performance regressions on IA32

2006-12-13 Thread Grigory Zagorodnev

Meissner, Michael wrote:

437.leslie3d-26%

it was felt that the PPRE patches that were added on November 13th were
the cause of the slowdown:
http://gcc.gnu.org/ml/gcc/2006-12/msg00023.html

Has anybody tried doing a run with just ppre disabled?



Right. PPRE appears to be the reason of slowdown.

-fno-tree-pre gets performance of cpu2006/437.leslie3d back to normal.
This is the worst case. And that will take much longer to verify whole 
set of cpu2006 benchmarks.


- Grigory


Re: Serious SPEC CPU 2006 FP performance regressions on IA32

2006-12-13 Thread Richard Guenther

On 12/13/06, Grigory Zagorodnev <[EMAIL PROTECTED]> wrote:

Meissner, Michael wrote:
>>> 437.leslie3d-26%
> it was felt that the PPRE patches that were added on November 13th were
> the cause of the slowdown:
> http://gcc.gnu.org/ml/gcc/2006-12/msg00023.html
>
> Has anybody tried doing a run with just ppre disabled?
>

Right. PPRE appears to be the reason of slowdown.

-fno-tree-pre gets performance of cpu2006/437.leslie3d back to normal.
This is the worst case. And that will take much longer to verify whole
set of cpu2006 benchmarks.


It would be sooo nice to have a (small) testcase that shows why we are
regressing.

Thanks!
Richard.


RE: Serious SPEC CPU 2006 FP performance regressions on IA32

2006-12-13 Thread Menezes, Evandro
> Meissner, Michael wrote:
> >>> 437.leslie3d  -26%
> > it was felt that the PPRE patches that were added on 
> November 13th were
> > the cause of the slowdown:
> > http://gcc.gnu.org/ml/gcc/2006-12/msg00023.html
> > 
> > Has anybody tried doing a run with just ppre disabled?
> 
> Right. PPRE appears to be the reason of slowdown.
> 
> -fno-tree-pre gets performance of cpu2006/437.leslie3d back to normal.
> This is the worst case. And that will take much longer to 
> verify whole 
> set of cpu2006 benchmarks.

If -fno-tree-pre disables PPRE, then it doesn't change much (4.3 relative to 
4.2 with -O2):

CPU2006 -O2 -O2 -fno-tree-pre
410.bwaves   -6% -8%
416.gamess  
433.milc
434.zeusmp  
435.gromacs 
436.cactusADM   
437.leslie3d-26%-27%
444.namd
447.dealII  
450.soplex  
453.povray  
454.calculix
459.GemsFDTD-12%-12%
465.tonto   
470.lbm 
481.wrf 
482.sphinx3  

Is PPRE enabled at -O2 at all?  I couldn't confirm that from the original 
patches, which enabled PPRE only at -O3.

-- 
___
Evandro Menezes   AMDAustin, TX





Re: Serious SPEC CPU 2006 FP performance regressions on IA32

2006-12-13 Thread Richard Guenther

On 12/13/06, Menezes, Evandro <[EMAIL PROTECTED]> wrote:

> Meissner, Michael wrote:
> >>> 437.leslie3d  -26%
> > it was felt that the PPRE patches that were added on
> November 13th were
> > the cause of the slowdown:
> > http://gcc.gnu.org/ml/gcc/2006-12/msg00023.html
> >
> > Has anybody tried doing a run with just ppre disabled?
>
> Right. PPRE appears to be the reason of slowdown.
>
> -fno-tree-pre gets performance of cpu2006/437.leslie3d back to normal.
> This is the worst case. And that will take much longer to
> verify whole
> set of cpu2006 benchmarks.

If -fno-tree-pre disables PPRE, then it doesn't change much (4.3 relative to 
4.2 with -O2):

CPU2006 -O2 -O2 -fno-tree-pre
410.bwaves   -6% -8%
416.gamess
433.milc
434.zeusmp
435.gromacs
436.cactusADM
437.leslie3d-26%-27%
444.namd
447.dealII
450.soplex
453.povray
454.calculix
459.GemsFDTD-12%-12%
465.tonto
470.lbm
481.wrf
482.sphinx3

Is PPRE enabled at -O2 at all?  I couldn't confirm that from the original 
patches, which enabled PPRE only at -O3.


PPRE is only enabled at -O3.

Richard.


Re: libffi compilation failure on Solaris 10?

2006-12-13 Thread Eric Botcazou
> ...which seems to be the case. Even when configure is told to use gld,
> somewhere down the line, gcc eventually decides that it should use
> /usr/ccs/bin/ld instead.

You need to rebuild the whole compiler if you want to switch from the Sun 
tools to GNU binutils, i.e --with-gnu-ld should be passed to the configure 
line of GCC itself and the compiler entirely rebuilt.

-- 
Eric Botcazou


Re: Unwinding CFI gcc practice of assumed `same value' regs

2006-12-13 Thread Michael Matz
Hi,

On Tue, 12 Dec 2006, Andrew Haley wrote:

>  > > In practice, %ebp either points to a call frame -- not necessarily 
>  > > the most recent one -- or is null.  I don't think that having an 
>  > > optional frame pointer mees you can use %ebp for anything random at 
>  > > all, but we need to make a clarification request of the ABI.
>  > 
>  > I don't see that as feasible.  If %ebp/%rbp may be used as a general 
>  > callee-saved register, then it can hold any value.
> 
> Sure, we already know that, as has been clear.  The question is *if* 
> %rbp may be used as a general callee-saved register that can hold any 
> value.

Yes of course it was meant to be used such.  The ABI actually only gives a 
recommendation that %rbp should be zero in the outermost frame, it's not a 
must.  The ABI _requires_ proper .eh_frame descriptors when unwinding is 
desired; so it's useless (and wrong) for any unwinder to look at %rbp and 
determine if it should stop.

Alternatively (though not sanctioned by the ABI) all functions through 
which unwinding is desired but for which no unwind info is created _have_ 
to use %rbp as frame pointer and not as general register.  In that case 
the zeroing of %rbp would be a usable stop condition for functions without 
unwind info.  But that's already outside the ABI.


Ciao,
Michael.


Re: Unwinding CFI gcc practice of assumed `same value' regs

2006-12-13 Thread Michael Matz
Hi,

On Mon, 11 Dec 2006, Jan Kratochvil wrote:

> currently (on x86_64) the gdb backtrace does not properly stop at the 
> outermost
> frame:
> 
> #3  0x0036ddb0610a in start_thread () from /lib64/tls/libpthread.so.0
> #4  0x0036dd0c68c3 in clone () from /lib64/tls/libc.so.6
> #5  0x in ?? ()
> 
> Currently it relies only on clearing %rbp (0x above is
> unrelated to it, it got read from uninitialized memory).
> 
> http://sourceware.org/ml/gdb/2004-08/msg00060.html suggests frame 
> pointer 0x0 should be enough for a debugger not finding CFI to stop 
> unwinding, still it is a heuristic.  In the -fno-frame-pointer compiled 
> code there is no indication the frame pointer register became a regular 
> one and 0x0 is its valid value.

Right.  Unwinding through functions (without frame pointer) requires CFI.  
If there is CFI for a function the unwinder must not look at %rbp for stop 
condition.  If there's no CFI for a function it can't be unwound (strictly 
per ABI).  If one relaxes that and wants to unwind through CFI-less 
functions it has to have a frame pointer.  In that case zero in that frame 
pointer could indicate the outermost frame (_if_ the suggestion in the ABI 
is adhered to, which noone is required to).


Ciao,
Michael.


Re: Unwinding CFI gcc practice of assumed `same value' regs

2006-12-13 Thread Michael Matz
Hi,

On Tue, 12 Dec 2006, Ulrich Drepper wrote:

> > Really?  Well, that's one interpretation.  I don't believe that, 
> > though.  It's certainly an inconsistency in the specification, which 
> > says that null-termination is supported, and this implies that you 
> > can't put a zero in there.
> 
> Again, this is just because the "authors" of the ABI didn't think.

[Blaeh, Ulrich talk] No, I think it's because the "readers" of the ABI 
can't read.


Ciao,
Michael.


why no boehm-gc tests?

2006-12-13 Thread Jack Howarth
I noticed that boehm-gc check doesn't work from within
the dejagnu framework. According to the notes in PR11412,
this was going to be fixable once the multi-lib stuff
was moved to the top level. I assume this has happened by
now so can we fix this for gcc 4.2?
 Jack



Re: why no boehm-gc tests?

2006-12-13 Thread David Daney

Jack Howarth wrote:

I noticed that boehm-gc check doesn't work from within
the dejagnu framework. According to the notes in PR11412,
this was going to be fixable once the multi-lib stuff
was moved to the top level. I assume this has happened by
now so can we fix this for gcc 4.2?
 Jack

You should probably target the trunk first.  Then after the patch is 
proven there a backport could be considered under the the branch commit 
criteria.


David Daney


[PATCH] Re: Unwinding CFI gcc practice of assumed `same value' regs

2006-12-13 Thread Jan Kratochvil
Hi,

On Tue, 12 Dec 2006 16:52:33 +0100, Jakub Jelinek wrote:
...
> Here is something that would handle by default same_value retaddr_column:
[ http://sources.redhat.com/ml/gdb/2006-12/msg00100.html ]

Thanks for this backward compatible glibc unwinder patch.  I wish to have it
accepted as a step in preparing the environment to use `.cfi_undefined PC'
sometimes in the future.

Attaching patch for current glibc CVS which removes the `.cfi_undefined PC'
unwinder handling requirement but which provides explicit return address 0 from
the `__clone' function.  Currently the 0 is already present there but it is
uninitialized value out of some TLS or 'struct pthread' area (did not check).

Attaching patch for current gdb CVS to properly terminate on return address 0.
The check was already present there but it got applied one backward step later.

I hope these three patches are be 100% reliable and also backward compatible.


Regards,
Jan
2006-12-13  Jan Kratochvil  <[EMAIL PROTECTED]>

* sysdeps/unix/sysv/linux/i386/clone.S: CFI `clone' unwinding
outermost frame indicator replaced by more unwinders compatible
termination indication of `PC == 0'.
* sysdeps/unix/sysv/linux/x86_64/clone.S: Likewise.

--- libc/sysdeps/unix/sysv/linux/i386/clone.S   3 Dec 2006 23:12:36 -   
1.27
+++ libc/sysdeps/unix/sysv/linux/i386/clone.S   13 Dec 2006 11:20:55 -
@@ -68,6 +68,8 @@ ENTRY (BP_SYM (__clone))
   thread is started with an alignment of (mod 16).  */
andl$0xfff0, %ecx
subl$28,%ecx
+   /* Terminate the stack frame by pretended return address 0.  */
+   movl$0,16(%ecx)
movlARG(%esp),%eax  /* no negative argument counts */
movl%eax,12(%ecx)
 
@@ -121,10 +123,15 @@ L(pseudo_end):
 
 L(thread_start):
cfi_startproc;
-   /* Clearing frame pointer is insufficient, use CFI.  */
-   cfi_undefined (eip);
-   /* Note: %esi is zero.  */
-   movl%esi,%ebp   /* terminate the stack frame */
+   /* This CFI recommended way of unwindable function is incompatible
+  across unwinders incl. the libgcc_s one.
+  cfi_undefined (eip);
+  */
+   /* Frame pointer 0 was considered as the stack frame termination
+  before but it is no longer valid for -fomit-frame-pointer code.
+  Still keep the backward compatibility and clear the register.
+  Note: %esi is zero.  */
+   movl%esi,%ebp
 #ifdef RESET_PID
testl   $CLONE_THREAD, %edi
je  L(newpid)
--- libc/sysdeps/unix/sysv/linux/x86_64/clone.S 3 Dec 2006 23:12:36 -   
1.7
+++ libc/sysdeps/unix/sysv/linux/x86_64/clone.S 13 Dec 2006 11:20:55 -
@@ -61,8 +61,12 @@ ENTRY (BP_SYM (__clone))
testq   %rsi,%rsi   /* no NULL stack pointers */
jz  SYSCALL_ERROR_LABEL
 
+   /* Prepare the data located at %rsp after `syscall' below.
+  Used only 3*8 bytes but the stack is 16 bytes aligned.  */
+   subq$32,%rsi
+   /* Terminate the stack frame by pretended return address 0.  */
+   movq$0,16(%rsi)
/* Insert the argument onto the new stack.  */
-   subq$16,%rsi
movq%rcx,8(%rsi)
 
/* Save the function pointer.  It will be popped off in the
@@ -90,10 +94,15 @@ L(pseudo_end):
 
 L(thread_start):
cfi_startproc;
-   /* Clearing frame pointer is insufficient, use CFI.  */
-   cfi_undefined (rip);
-   /* Clear the frame pointer.  The ABI suggests this be done, to mark
-  the outermost frame obviously.  */
+   /* This CFI recommended way of unwindable function is incompatible
+  across unwinders incl. the libgcc_s one.
+  cfi_undefined (rip);
+  */
+   /* Frame pointer 0 was considered as the stack frame termination
+  before but it is no longer valid for -fomit-frame-pointer code.
+  Still keep the backward compatibility and clear the register,
+  the ABI suggests this be done, to mark the outermost frame
+  obviously.  */
xorl%ebp, %ebp
 
 #ifdef RESET_PID
2006-12-13  Jan Kratochvil  <[EMAIL PROTECTED]>

* gdb/frame.c (get_prev_frame): Already the first `PC == 0' stack frame
is declared invalid, not the second one as before.

2006-12-13  Jan Kratochvil  <[EMAIL PROTECTED]>

* gdb.threads/bt-clone-stop.exp, gdb.threads/bt-clone-stop.c:
Backtraced `clone' must not have `PC == 0' as its previous frame.


--- ./gdb/frame.c   10 Nov 2006 20:11:35 -  1.215
+++ ./gdb/frame.c   13 Dec 2006 19:06:19 -
@@ -1390,19 +1390,28 @@ get_prev_frame (struct frame_info *this_
   return NULL;
 }
 
+  prev_frame = get_prev_frame_1 (this_frame);
+  if (!prev_frame)
+return NULL;
+
   /* Assume that the only way to get a zero PC is through something
  like a SIGSEGV or a dummy frame, and hence that NORMAL frames
- will never unwi

Back End Responsibilities + RTL Generation

2006-12-13 Thread Frank Riese
Hi!

I need some clearification concerning the responsibilities of Middle End, Back 
End, the generation of the control flow graph (CFG) and RTL. I looked at 
several articles about the internal structure of the GCC and looked at the 
internals documentation. However, I would like to have a second opinion about 
this matter.

One of my professors stated that a GCC Back End uses the Control Flow Graph as 
its input and that generation of RTL expressions occurs later on. What roles 
do Back and Middle End play in generation of RTL? Would you consider the CFG 
or RTL expressions as the input for a GCC Back End?

I also remembered having read the following line from the gcc internals 
documentation. However, I'm still not sure how to interpret this:

"A control flow graph (CFG) is a data structure built on top of the 
intermediate code representation (the RTL or tree instruction stream) 
abstracting the control flow behavior of a function that is being compiled"

Does that mean that a control flow graph is built after rtl has been generated 
or that information about that information about the control flow is 
incorporated into the RTL data structures? Could somebody clearify this, 
please?

Cheers,
Frank


Re: g++ doesn't unroll a loop it should unroll

2006-12-13 Thread Denis Vlasenko
Disclaimer: I am not a compiler developer.

On Wednesday 13 December 2006 12:44, Benoît Jacob wrote:
> I'm developing a Free C++ template library (1) in which it is very important 
> that certain loops get unrolled, but at the same time I can't unroll them by 
> hand, because they depend on template parameters.
> 
> My problem is that G++ 4.1.1 (Gentoo) doesn't unroll these loops.
> 
> I have written a standalone simple program showing this problem; I attach it 
> (toto.cpp) and I also paste it below. This program does a loop if UNROLL is 
> not defined, and does the same thing but with the loop unrolled by hand if 
> UNROLL is defined. So one would expect that with g++ -O3, the speed would be 
> the same in both cases. Alas, it's not:
> 
> g++ -DUNROLL -O3 toto.cpp -o toto   ---> toto runs in 0.3 seconds
> g++ -O3 toto.cpp -o toto---> toto runs in 1.9 seconds
> 
> So what can I do? Is that a bug in g++?

C++ doesn't specify that compiler shall unroll loops, so it cannot be
classified as "real" bug.

# g++ -c -O3 toto.cpp -o toto.o
# g++ -DUNROLL -O3 toto.cpp -o toto_unroll.o -c
# size toto.o toto_unroll.o
   textdata bss dec hex filename
525   8   1 534 216 toto.o
359   8   1 368 170 toto_unroll.o

How can C++ compiler know that you are willing to trade
so much of text size for performance?

I usually find myself on opposite side: I use -Os but gcc
still eats more space in the name of speed in certain
situations.

Re code: I would use memset + just a single, non-nested for()
loop anyway... you C++ people tend to overtax compiler with
optimizations. Is it really necessary to do (i == j) * factor
when (i == j) ? factor : 0 is easier for compiler to grok?

> If yes, any hope to see it fixed soon? 
> 
> Cheers,
> Benoit
> 
> (1) : Eigen, see http://eigen.tuxfamily.org

"Eigen is a lightweight C++ template library for vector and matrix
math, a.k.a. linear algebra."

Template lib for vector and matrix math sounds like a performance
disaster in the making, at least for me. However, maybe you are
truly smart guy and can do miracles.

Cheers,
--
vda


Eric Christopher appointed Darwin maintainer

2006-12-13 Thread David Edelsohn
I am pleased to announce that the GCC Steering Committee has
appointed Eric Christopher as Darwin co-maintainer.

Please join me in congratulating Eric on his new role.  Eric,
please update your listings in the MAINTAINERS file.

Happy hacking!
David



Re: Back End Responsibilities + RTL Generation

2006-12-13 Thread Steven Bosscher

On 12/13/06, Frank Riese <[EMAIL PROTECTED]> wrote:

One of my professors stated that a GCC Back End uses the Control Flow Graph as
its input and that generation of RTL expressions occurs later on.


That is not true.


What roles
do Back and Middle End play in generation of RTL? Would you consider the CFG
or RTL expressions as the input for a GCC Back End?


Let me first say that the definitions of front end, back end, and
middle end are a bit hairy.  You have to carefully define what you
classify as belonging to the middle end or the back end.  I actually
try to avoid the terms nowadays.

Also, you have to be specific about the version of GCC that you're
talking about. GCC2, GCC3 and GCC4 are completely different
internally, and even the differences between various GCC4 releases are
quite significant.

Anyway...

The steps through the compiler are as follows:

1. front end runs, produces GENERIC
2. GENERIC is lowered to GIMPLE
3. a CFG is constructed for GIMPLE
4. GIMPLE (tree-ssa) optimizers run
5. GIMPLE is expanded to RTL, while preserving the CFG
6. RTL optimizers run
7. assembly is written out

The RTL generation in step 5 is done one statement at a time.   The
part of the compiler that generates the RTL is a mix of shared code
and of back end code: A single GIMPLE statement at a time is passed to
the middle-end expand routines, which tries to produce RTL for this
statement using instructions available on the target machine.  The
available instructions are defined by the target machine description
(i.e. the back end).

Try to understand cfgexpand.c and the section on named RTL patterns in
the GCC internals manual.


I also remembered having read the following line from the gcc internals
documentation. However, I'm still not sure how to interpret this:

"A control flow graph (CFG) is a data structure built on top of the
intermediate code representation (the RTL or tree instruction stream)
abstracting the control flow behavior of a function that is being compiled"

Does that mean that a control flow graph is built after rtl has been generated
or that information about that information about the control flow is
incorporated into the RTL data structures?


Neither.

I'm assuming you're interested in how this works in recent GCC
releases, i.e. GCC4 based. In GCC4, the control flow graph is built on
GIMPLE, the tree-ssa optimizers need a CFG too.  This CFG is kept
up-to-date through the optimizers and through expansion to RTL.  This
means that GCC builds the CFG only once for each function.

The data structures for the CFG are in basic-block.h.  These data
structures are most definitely *not* incorporated into the RTL
structures.  The CFG is independent of the intermediate
representations for the function instructions.  It has to be, or you
could have the same CFG data structures for both GIMPLE and RTL.

Hope this helps,

Gr.
Steven


Re: g++ doesn't unroll a loop it should unroll

2006-12-13 Thread Benoît Jacob
Le mercredi 13 décembre 2006 23:09, Denis Vlasenko a écrit :
> C++ doesn't specify that compiler shall unroll loops, so it cannot be
> classified as "real" bug.

OK, but then, even if I explicitly ask gcc to unroll loops 
with -funroll-loops, it still doesn't unroll them completely and is still as 
slow. See bug report here:

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=30201

> Re code: I would use memset + just a single

no, in this example the numbers are double, but in my template library the 
type is a "typename T" and I can make no assumption as to the bit 
representation of static_cast(0).

> loop anyway... you C++ people tend to overtax compiler with
> optimizations. Is it really necessary to do (i == j) * factor
> when (i == j) ? factor : 0 is easier for compiler to grok?

Of course I tried it. It's even slower. Doesn't help the compiler unroll the 
loop, and now there's a branch at each iteration.

> Template lib for vector and matrix math sounds like a performance
> disaster in the making, at least for me. However, maybe you are
> truly smart guy and can do miracles.

I don't understand why you say that. At the language specification level, 
templates come with no inherent speed overhead. All of the template stuff is 
unfolded at compile time, none of it remains visible in the binary, so it 
shouldn't make the binary slower.

Benoit


pgphvVzwRwvyK.pgp
Description: PGP signature


Re: g++ doesn't unroll a loop it should unroll

2006-12-13 Thread Steven Bosscher

On 12/14/06, Benoît Jacob <[EMAIL PROTECTED]> wrote:

I don't understand why you say that. At the language specification level,
templates come with no inherent speed overhead. All of the template stuff is
unfolded at compile time, none of it remains visible in the binary, so it
shouldn't make the binary slower.


You're confusing theory and practice...

Gr.
Steven