Re: 32 bit jump instruction.
In http://gcc.gnu.org/ml/gcc/2006-12/msg00328.html, you wrote: >> On 06 Dec 2006 23:13:35 -0800, Ian Lance Taylor <[EMAIL PROTECTED]> wrote: >> If you can't afford to lose a register, then I think your only option >> is to pick some callee-saved register and have each branch instruction >> explicitly clobber it. Then it will be available for use in a long >> branch, and it will be available for use within a basic block. This >> is far from ideal, but I don't know a better way to handle it within >> gcc's current framework. > Can i get more clarity on this part. Is it implemented in any other backends? > When you say "pick some callee-saved register ", is it to pick them > randomly from an available set in CALL_USED_REGISTERS or a specific > register. The SH does register scavenging, and sharing of far branches. Look at config/sh/sh.c:split_branches . Also see PR 29336 for how this could be better integrated with machine-specific constant pool placement. However, because the SH has delayed branches, there is always a guaranteed way to find a register - one can be saved, and then be restored in the delay slot. An architecture without delay slots would have to have another fallback mechanism, e.g. inserting a register restore before the target - possibly with a short jump around it, duplicate instructions from the target till a register dies, or inserting a register restore and jump in the vincinity of the target.
Re: 32 bit jump instruction.
On 12/13/06, Joern Rennecke <[EMAIL PROTECTED]> wrote: In http://gcc.gnu.org/ml/gcc/2006-12/msg00328.html, you wrote: However, because the SH has delayed branches, there is always a guaranteed way to find a register - one can be saved, and then be restored in the delay slot. Heh, that's an interesting feature :-) How does that work? I always thought that the semantics of delayed insns is that the insn in the delay slot is executed *before* the branch. But that is apparently not the case, or the branch register would have been over-written before the branch. How does that work on SH? Gr. Steven
Memory allocation for local variables.
Hi all, I tried compiling the above two programs : on x86, 32 bit machines. And when I used objdump on that I saw the following code. Can anyone help me know, Why in the objdump of our first program the esp is decremented by 18H bytes and in the second program the esp is decremented by 28H bytes. How actually is teh memory allocated by gcc for local variables. Kindly help. int main() { char x; return 0; } [EMAIL PROTECTED] ~]# gcc test.c [EMAIL PROTECTED] ~]# objdump -S a.out | less 08048348 : 8048348: 55 push %ebp 8048349: 89 e5mov%esp,%ebp 804834b: 83 ec 18 sub$0x18,%esp 804834e: 83 e4 f0and$0xfff0,%esp 8048351: b8 00 00 00 00 mov$0x0,%eax 8048356: 83 c0 0fadd$0xf,%eax 8048359: 83 c0 0fadd$0xf,%eax 804835c: c1 e8 04 shr$0x4,%eax 804835f: c1 e0 04shl$0x4,%eax 8048362: 29 c4sub%eax,%esp 8048364: b8 00 00 00 00mov$0x0,%eax 8048369: c9 leave 804836a: c3 ret 804836b: 90 nop int main() { double x,y,z; char p,q,r; return 0; } 08048348 : 8048348: 55 push %ebp 8048349: 89 e5 mov%esp,%ebp 804834b: 83 ec 28sub$0x28,%esp 804834e: 83 e4 f0and$0xfff0,%esp 8048351: b8 00 00 00 00 mov$0x0,%eax 8048356: 83 c0 0fadd$0xf,%eax 8048359: 83 c0 0fadd$0xf,%eax 804835c: c1 e8 04shr$0x4,%eax 804835f: c1 e0 04shl$0x4,%eax 8048362: 29 c4 sub%eax,%esp 8048364: b8 00 00 00 00 mov$0x0,%eax 8048369: c9 leave 804836a: c3 ret 804836b: 90 nop -- Regards, Sandeep If the facts don't fit the theory, change the facts.
g++ doesn't unroll a loop it should unroll
Hi, I'm developing a Free C++ template library (1) in which it is very important that certain loops get unrolled, but at the same time I can't unroll them by hand, because they depend on template parameters. My problem is that G++ 4.1.1 (Gentoo) doesn't unroll these loops. I have written a standalone simple program showing this problem; I attach it (toto.cpp) and I also paste it below. This program does a loop if UNROLL is not defined, and does the same thing but with the loop unrolled by hand if UNROLL is defined. So one would expect that with g++ -O3, the speed would be the same in both cases. Alas, it's not: g++ -DUNROLL -O3 toto.cpp -o toto ---> toto runs in 0.3 seconds g++ -O3 toto.cpp -o toto---> toto runs in 1.9 seconds So what can I do? Is that a bug in g++? If yes, any hope to see it fixed soon? Cheers, Benoit (1) : Eigen, see http://eigen.tuxfamily.org file: toto.cpp #include class Matrix { public: double data[9]; double & operator()( int i, int j ) { return data[i + 3 * j]; } void loadScaling( double factor ); }; void Matrix::loadScaling( double factor) { #ifdef UNROLL (*this)( 0, 0 ) = factor; (*this)( 1, 0 ) = 0; (*this)( 2, 0 ) = 0; (*this)( 0, 1 ) = 0; (*this)( 1, 1 ) = factor; (*this)( 2, 1 ) = 0; (*this)( 0, 2 ) = 0; (*this)( 1, 2 ) = 0; (*this)( 2, 2 ) = factor; #else for( int i = 0; i < 3; i++ ) for( int j = 0; j < 3; j++ ) (*this)(i, j) = (i == j) * factor; #endif } int main( int argc, char *argv[] ) { Matrix m; for( int i = 0; i < 1; i++ ) m.loadScaling( i ); std::cout << "m(0,0) = " << m(0,0) << std::endl; } #include class Matrix { public: double data[9]; double & operator()( int i, int j ) { return data[i + 3 * j]; } void loadScaling( double factor ); }; void Matrix::loadScaling( double factor) { #ifdef UNROLL (*this)( 0, 0 ) = factor; (*this)( 1, 0 ) = 0; (*this)( 2, 0 ) = 0; (*this)( 0, 1 ) = 0; (*this)( 1, 1 ) = factor; (*this)( 2, 1 ) = 0; (*this)( 0, 2 ) = 0; (*this)( 1, 2 ) = 0; (*this)( 2, 2 ) = factor; #else for( int i = 0; i < 3; i++ ) for( int j = 0; j < 3; j++ ) (*this)(i, j) = (i == j) * factor; #endif } int main( int argc, char *argv[] ) { Matrix m; for( int i = 0; i < 1; i++ ) m.loadScaling( i ); std::cout << "m(0,0) = " << m(0,0) << std::endl; } pgpWZeXGqxnTe.pgp Description: PGP signature
Re: 32 bit jump instruction.
Quoting Steven Bosscher <[EMAIL PROTECTED]>: > On 12/13/06, Joern Rennecke <[EMAIL PROTECTED]> wrote: > > In http://gcc.gnu.org/ml/gcc/2006-12/msg00328.html, you wrote: > > However, because the SH has delayed branches, there is always a guaranteed > way > > to find a register - one can be saved, and then be restored in the delay > slot. > > Heh, that's an interesting feature :-) > > How does that work? I always thought that the semantics of delayed > insns is that the insn in the delay slot is executed *before* the > branch. But that is apparently not the case, or the branch register > would have been over-written before the branch. How does that work on > SH? The jump address is calculated, then the delay slot instruction is executed - or sometimes, if the instructions are pairable, the delay slot insn is executed simultanously with the jump address calculations, then - or during the delay slot insn execution - the target instruction is fetched, and then executed. You can look into sim/sh/interp.c for a functional model of how this works from the programmer's point of view.
Re: g++ doesn't unroll a loop it should unroll
On 12/13/06, Benoît Jacob <[EMAIL PROTECTED]> wrote: g++ -DUNROLL -O3 toto.cpp -o toto ---> toto runs in 0.3 seconds g++ -O3 toto.cpp -o toto---> toto runs in 1.9 seconds So what can I do? Is that a bug in g++? If yes, any hope to see it fixed soon? You could try adding -funroll-loops. Gr. Steven
Re: Memory allocation for local variables.
On 12/13/06, Sandeep Kumar <[EMAIL PROTECTED]> wrote: Hi all, I tried compiling the above two programs : on x86, 32 bit machines. [EMAIL PROTECTED] ~]# gcc test.c Try with optimization enabled (try -O1 and/or -O2). Gr. Steven
Re: g++ doesn't unroll a loop it should unroll
I had already tried that. That doesn't change anything. I had also tried passing a higher --param max-unroll-times. No effect. So, any idea? The example program toto.cpp is so simple, I can't believe g++ can't handle it. Surely there must be something simple that I haven't understood? Benoit Le mercredi 13 décembre 2006 13:12, Steven Bosscher a écrit : > On 12/13/06, Benoît Jacob <[EMAIL PROTECTED]> wrote: > > g++ -DUNROLL -O3 toto.cpp -o toto ---> toto runs in 0.3 seconds > > g++ -O3 toto.cpp -o toto---> toto runs in 1.9 seconds > > > > So what can I do? Is that a bug in g++? If yes, any hope to see it fixed > > soon? > > You could try adding -funroll-loops. > > Gr. > Steven pgpsNhllyNHF0.pgp Description: PGP signature
Re: libffi compilation failure on Solaris 10?
begin quoting Eric Botcazou as of Thu, Nov 30, 2006 at 10:05:21PM +0100: [snip] > ... if the user is trying to link objects files assembled by the GNU assembler > using the Sun linker. ...which seems to be the case. Even when configure is told to use gld, somewhere down the line, gcc eventually decides that it should use /usr/ccs/bin/ld instead. http://www.stremler.net/temp/GST/try3.txt I see that 2.3.1 is out, but I'm wondering if I need to start with getting a later version of GCC and friends. -- Stewart Stremler
Re: g++ doesn't unroll a loop it should unroll
Benoît Jacob <[EMAIL PROTECTED]> writes: > I'm developing a Free C++ template library (1) in which it is very important > that certain loops get unrolled, but at the same time I can't unroll them by > hand, because they depend on template parameters. > > My problem is that G++ 4.1.1 (Gentoo) doesn't unroll these loops. > > I have written a standalone simple program showing this problem; I attach it > (toto.cpp) and I also paste it below. This program does a loop if UNROLL is > not defined, and does the same thing but with the loop unrolled by hand if > UNROLL is defined. So one would expect that with g++ -O3, the speed would be > the same in both cases. Alas, it's not: When I try it, gcc does unroll the loops. It completely unrolls the inner loop, but only partially unrolls the outer loop. The reason it doesn't completely unroll the outer loop is simply that gcc doesn't attempt to completely unroll loops which contain inner loops. This could probably be fixed: we could probably completely unroll a loop if all its inner loop were completely unrolled. I encourage you to file a bug report. See http://gcc.gnu.org/bugs.html. Ian
Re: g++ doesn't unroll a loop it should unroll
On Wed, 13 Dec 2006, Ian Lance Taylor wrote: When I try it, gcc does unroll the loops. It completely unrolls the inner loop, but only partially unrolls the outer loop. The reason it doesn't completely unroll the outer loop is simply that gcc doesn't attempt to completely unroll loops which contain inner loops. OK, I don't have the skill to check the binary code. All I know is that with UNROLL defined it runs more than 5x faster, so there is room for improvement :) This could probably be fixed: we could probably completely unroll a loop if all its inner loop were completely unrolled. I encourage you to file a bug report. See http://gcc.gnu.org/bugs.html. OK: I didn't dare decide on my own that it's a bug, but if you say so... I file a bug report now. Benoit
Re: Serious SPEC CPU 2006 FP performance regressions on IA32
Meissner, Michael wrote: 437.leslie3d-26% it was felt that the PPRE patches that were added on November 13th were the cause of the slowdown: http://gcc.gnu.org/ml/gcc/2006-12/msg00023.html Has anybody tried doing a run with just ppre disabled? Right. PPRE appears to be the reason of slowdown. -fno-tree-pre gets performance of cpu2006/437.leslie3d back to normal. This is the worst case. And that will take much longer to verify whole set of cpu2006 benchmarks. - Grigory
Re: Serious SPEC CPU 2006 FP performance regressions on IA32
On 12/13/06, Grigory Zagorodnev <[EMAIL PROTECTED]> wrote: Meissner, Michael wrote: >>> 437.leslie3d-26% > it was felt that the PPRE patches that were added on November 13th were > the cause of the slowdown: > http://gcc.gnu.org/ml/gcc/2006-12/msg00023.html > > Has anybody tried doing a run with just ppre disabled? > Right. PPRE appears to be the reason of slowdown. -fno-tree-pre gets performance of cpu2006/437.leslie3d back to normal. This is the worst case. And that will take much longer to verify whole set of cpu2006 benchmarks. It would be sooo nice to have a (small) testcase that shows why we are regressing. Thanks! Richard.
RE: Serious SPEC CPU 2006 FP performance regressions on IA32
> Meissner, Michael wrote: > >>> 437.leslie3d -26% > > it was felt that the PPRE patches that were added on > November 13th were > > the cause of the slowdown: > > http://gcc.gnu.org/ml/gcc/2006-12/msg00023.html > > > > Has anybody tried doing a run with just ppre disabled? > > Right. PPRE appears to be the reason of slowdown. > > -fno-tree-pre gets performance of cpu2006/437.leslie3d back to normal. > This is the worst case. And that will take much longer to > verify whole > set of cpu2006 benchmarks. If -fno-tree-pre disables PPRE, then it doesn't change much (4.3 relative to 4.2 with -O2): CPU2006 -O2 -O2 -fno-tree-pre 410.bwaves -6% -8% 416.gamess 433.milc 434.zeusmp 435.gromacs 436.cactusADM 437.leslie3d-26%-27% 444.namd 447.dealII 450.soplex 453.povray 454.calculix 459.GemsFDTD-12%-12% 465.tonto 470.lbm 481.wrf 482.sphinx3 Is PPRE enabled at -O2 at all? I couldn't confirm that from the original patches, which enabled PPRE only at -O3. -- ___ Evandro Menezes AMDAustin, TX
Re: Serious SPEC CPU 2006 FP performance regressions on IA32
On 12/13/06, Menezes, Evandro <[EMAIL PROTECTED]> wrote: > Meissner, Michael wrote: > >>> 437.leslie3d -26% > > it was felt that the PPRE patches that were added on > November 13th were > > the cause of the slowdown: > > http://gcc.gnu.org/ml/gcc/2006-12/msg00023.html > > > > Has anybody tried doing a run with just ppre disabled? > > Right. PPRE appears to be the reason of slowdown. > > -fno-tree-pre gets performance of cpu2006/437.leslie3d back to normal. > This is the worst case. And that will take much longer to > verify whole > set of cpu2006 benchmarks. If -fno-tree-pre disables PPRE, then it doesn't change much (4.3 relative to 4.2 with -O2): CPU2006 -O2 -O2 -fno-tree-pre 410.bwaves -6% -8% 416.gamess 433.milc 434.zeusmp 435.gromacs 436.cactusADM 437.leslie3d-26%-27% 444.namd 447.dealII 450.soplex 453.povray 454.calculix 459.GemsFDTD-12%-12% 465.tonto 470.lbm 481.wrf 482.sphinx3 Is PPRE enabled at -O2 at all? I couldn't confirm that from the original patches, which enabled PPRE only at -O3. PPRE is only enabled at -O3. Richard.
Re: libffi compilation failure on Solaris 10?
> ...which seems to be the case. Even when configure is told to use gld, > somewhere down the line, gcc eventually decides that it should use > /usr/ccs/bin/ld instead. You need to rebuild the whole compiler if you want to switch from the Sun tools to GNU binutils, i.e --with-gnu-ld should be passed to the configure line of GCC itself and the compiler entirely rebuilt. -- Eric Botcazou
Re: Unwinding CFI gcc practice of assumed `same value' regs
Hi, On Tue, 12 Dec 2006, Andrew Haley wrote: > > > In practice, %ebp either points to a call frame -- not necessarily > > > the most recent one -- or is null. I don't think that having an > > > optional frame pointer mees you can use %ebp for anything random at > > > all, but we need to make a clarification request of the ABI. > > > > I don't see that as feasible. If %ebp/%rbp may be used as a general > > callee-saved register, then it can hold any value. > > Sure, we already know that, as has been clear. The question is *if* > %rbp may be used as a general callee-saved register that can hold any > value. Yes of course it was meant to be used such. The ABI actually only gives a recommendation that %rbp should be zero in the outermost frame, it's not a must. The ABI _requires_ proper .eh_frame descriptors when unwinding is desired; so it's useless (and wrong) for any unwinder to look at %rbp and determine if it should stop. Alternatively (though not sanctioned by the ABI) all functions through which unwinding is desired but for which no unwind info is created _have_ to use %rbp as frame pointer and not as general register. In that case the zeroing of %rbp would be a usable stop condition for functions without unwind info. But that's already outside the ABI. Ciao, Michael.
Re: Unwinding CFI gcc practice of assumed `same value' regs
Hi, On Mon, 11 Dec 2006, Jan Kratochvil wrote: > currently (on x86_64) the gdb backtrace does not properly stop at the > outermost > frame: > > #3 0x0036ddb0610a in start_thread () from /lib64/tls/libpthread.so.0 > #4 0x0036dd0c68c3 in clone () from /lib64/tls/libc.so.6 > #5 0x in ?? () > > Currently it relies only on clearing %rbp (0x above is > unrelated to it, it got read from uninitialized memory). > > http://sourceware.org/ml/gdb/2004-08/msg00060.html suggests frame > pointer 0x0 should be enough for a debugger not finding CFI to stop > unwinding, still it is a heuristic. In the -fno-frame-pointer compiled > code there is no indication the frame pointer register became a regular > one and 0x0 is its valid value. Right. Unwinding through functions (without frame pointer) requires CFI. If there is CFI for a function the unwinder must not look at %rbp for stop condition. If there's no CFI for a function it can't be unwound (strictly per ABI). If one relaxes that and wants to unwind through CFI-less functions it has to have a frame pointer. In that case zero in that frame pointer could indicate the outermost frame (_if_ the suggestion in the ABI is adhered to, which noone is required to). Ciao, Michael.
Re: Unwinding CFI gcc practice of assumed `same value' regs
Hi, On Tue, 12 Dec 2006, Ulrich Drepper wrote: > > Really? Well, that's one interpretation. I don't believe that, > > though. It's certainly an inconsistency in the specification, which > > says that null-termination is supported, and this implies that you > > can't put a zero in there. > > Again, this is just because the "authors" of the ABI didn't think. [Blaeh, Ulrich talk] No, I think it's because the "readers" of the ABI can't read. Ciao, Michael.
why no boehm-gc tests?
I noticed that boehm-gc check doesn't work from within the dejagnu framework. According to the notes in PR11412, this was going to be fixable once the multi-lib stuff was moved to the top level. I assume this has happened by now so can we fix this for gcc 4.2? Jack
Re: why no boehm-gc tests?
Jack Howarth wrote: I noticed that boehm-gc check doesn't work from within the dejagnu framework. According to the notes in PR11412, this was going to be fixable once the multi-lib stuff was moved to the top level. I assume this has happened by now so can we fix this for gcc 4.2? Jack You should probably target the trunk first. Then after the patch is proven there a backport could be considered under the the branch commit criteria. David Daney
[PATCH] Re: Unwinding CFI gcc practice of assumed `same value' regs
Hi, On Tue, 12 Dec 2006 16:52:33 +0100, Jakub Jelinek wrote: ... > Here is something that would handle by default same_value retaddr_column: [ http://sources.redhat.com/ml/gdb/2006-12/msg00100.html ] Thanks for this backward compatible glibc unwinder patch. I wish to have it accepted as a step in preparing the environment to use `.cfi_undefined PC' sometimes in the future. Attaching patch for current glibc CVS which removes the `.cfi_undefined PC' unwinder handling requirement but which provides explicit return address 0 from the `__clone' function. Currently the 0 is already present there but it is uninitialized value out of some TLS or 'struct pthread' area (did not check). Attaching patch for current gdb CVS to properly terminate on return address 0. The check was already present there but it got applied one backward step later. I hope these three patches are be 100% reliable and also backward compatible. Regards, Jan 2006-12-13 Jan Kratochvil <[EMAIL PROTECTED]> * sysdeps/unix/sysv/linux/i386/clone.S: CFI `clone' unwinding outermost frame indicator replaced by more unwinders compatible termination indication of `PC == 0'. * sysdeps/unix/sysv/linux/x86_64/clone.S: Likewise. --- libc/sysdeps/unix/sysv/linux/i386/clone.S 3 Dec 2006 23:12:36 - 1.27 +++ libc/sysdeps/unix/sysv/linux/i386/clone.S 13 Dec 2006 11:20:55 - @@ -68,6 +68,8 @@ ENTRY (BP_SYM (__clone)) thread is started with an alignment of (mod 16). */ andl$0xfff0, %ecx subl$28,%ecx + /* Terminate the stack frame by pretended return address 0. */ + movl$0,16(%ecx) movlARG(%esp),%eax /* no negative argument counts */ movl%eax,12(%ecx) @@ -121,10 +123,15 @@ L(pseudo_end): L(thread_start): cfi_startproc; - /* Clearing frame pointer is insufficient, use CFI. */ - cfi_undefined (eip); - /* Note: %esi is zero. */ - movl%esi,%ebp /* terminate the stack frame */ + /* This CFI recommended way of unwindable function is incompatible + across unwinders incl. the libgcc_s one. + cfi_undefined (eip); + */ + /* Frame pointer 0 was considered as the stack frame termination + before but it is no longer valid for -fomit-frame-pointer code. + Still keep the backward compatibility and clear the register. + Note: %esi is zero. */ + movl%esi,%ebp #ifdef RESET_PID testl $CLONE_THREAD, %edi je L(newpid) --- libc/sysdeps/unix/sysv/linux/x86_64/clone.S 3 Dec 2006 23:12:36 - 1.7 +++ libc/sysdeps/unix/sysv/linux/x86_64/clone.S 13 Dec 2006 11:20:55 - @@ -61,8 +61,12 @@ ENTRY (BP_SYM (__clone)) testq %rsi,%rsi /* no NULL stack pointers */ jz SYSCALL_ERROR_LABEL + /* Prepare the data located at %rsp after `syscall' below. + Used only 3*8 bytes but the stack is 16 bytes aligned. */ + subq$32,%rsi + /* Terminate the stack frame by pretended return address 0. */ + movq$0,16(%rsi) /* Insert the argument onto the new stack. */ - subq$16,%rsi movq%rcx,8(%rsi) /* Save the function pointer. It will be popped off in the @@ -90,10 +94,15 @@ L(pseudo_end): L(thread_start): cfi_startproc; - /* Clearing frame pointer is insufficient, use CFI. */ - cfi_undefined (rip); - /* Clear the frame pointer. The ABI suggests this be done, to mark - the outermost frame obviously. */ + /* This CFI recommended way of unwindable function is incompatible + across unwinders incl. the libgcc_s one. + cfi_undefined (rip); + */ + /* Frame pointer 0 was considered as the stack frame termination + before but it is no longer valid for -fomit-frame-pointer code. + Still keep the backward compatibility and clear the register, + the ABI suggests this be done, to mark the outermost frame + obviously. */ xorl%ebp, %ebp #ifdef RESET_PID 2006-12-13 Jan Kratochvil <[EMAIL PROTECTED]> * gdb/frame.c (get_prev_frame): Already the first `PC == 0' stack frame is declared invalid, not the second one as before. 2006-12-13 Jan Kratochvil <[EMAIL PROTECTED]> * gdb.threads/bt-clone-stop.exp, gdb.threads/bt-clone-stop.c: Backtraced `clone' must not have `PC == 0' as its previous frame. --- ./gdb/frame.c 10 Nov 2006 20:11:35 - 1.215 +++ ./gdb/frame.c 13 Dec 2006 19:06:19 - @@ -1390,19 +1390,28 @@ get_prev_frame (struct frame_info *this_ return NULL; } + prev_frame = get_prev_frame_1 (this_frame); + if (!prev_frame) +return NULL; + /* Assume that the only way to get a zero PC is through something like a SIGSEGV or a dummy frame, and hence that NORMAL frames - will never unwi
Back End Responsibilities + RTL Generation
Hi! I need some clearification concerning the responsibilities of Middle End, Back End, the generation of the control flow graph (CFG) and RTL. I looked at several articles about the internal structure of the GCC and looked at the internals documentation. However, I would like to have a second opinion about this matter. One of my professors stated that a GCC Back End uses the Control Flow Graph as its input and that generation of RTL expressions occurs later on. What roles do Back and Middle End play in generation of RTL? Would you consider the CFG or RTL expressions as the input for a GCC Back End? I also remembered having read the following line from the gcc internals documentation. However, I'm still not sure how to interpret this: "A control flow graph (CFG) is a data structure built on top of the intermediate code representation (the RTL or tree instruction stream) abstracting the control flow behavior of a function that is being compiled" Does that mean that a control flow graph is built after rtl has been generated or that information about that information about the control flow is incorporated into the RTL data structures? Could somebody clearify this, please? Cheers, Frank
Re: g++ doesn't unroll a loop it should unroll
Disclaimer: I am not a compiler developer. On Wednesday 13 December 2006 12:44, Benoît Jacob wrote: > I'm developing a Free C++ template library (1) in which it is very important > that certain loops get unrolled, but at the same time I can't unroll them by > hand, because they depend on template parameters. > > My problem is that G++ 4.1.1 (Gentoo) doesn't unroll these loops. > > I have written a standalone simple program showing this problem; I attach it > (toto.cpp) and I also paste it below. This program does a loop if UNROLL is > not defined, and does the same thing but with the loop unrolled by hand if > UNROLL is defined. So one would expect that with g++ -O3, the speed would be > the same in both cases. Alas, it's not: > > g++ -DUNROLL -O3 toto.cpp -o toto ---> toto runs in 0.3 seconds > g++ -O3 toto.cpp -o toto---> toto runs in 1.9 seconds > > So what can I do? Is that a bug in g++? C++ doesn't specify that compiler shall unroll loops, so it cannot be classified as "real" bug. # g++ -c -O3 toto.cpp -o toto.o # g++ -DUNROLL -O3 toto.cpp -o toto_unroll.o -c # size toto.o toto_unroll.o textdata bss dec hex filename 525 8 1 534 216 toto.o 359 8 1 368 170 toto_unroll.o How can C++ compiler know that you are willing to trade so much of text size for performance? I usually find myself on opposite side: I use -Os but gcc still eats more space in the name of speed in certain situations. Re code: I would use memset + just a single, non-nested for() loop anyway... you C++ people tend to overtax compiler with optimizations. Is it really necessary to do (i == j) * factor when (i == j) ? factor : 0 is easier for compiler to grok? > If yes, any hope to see it fixed soon? > > Cheers, > Benoit > > (1) : Eigen, see http://eigen.tuxfamily.org "Eigen is a lightweight C++ template library for vector and matrix math, a.k.a. linear algebra." Template lib for vector and matrix math sounds like a performance disaster in the making, at least for me. However, maybe you are truly smart guy and can do miracles. Cheers, -- vda
Eric Christopher appointed Darwin maintainer
I am pleased to announce that the GCC Steering Committee has appointed Eric Christopher as Darwin co-maintainer. Please join me in congratulating Eric on his new role. Eric, please update your listings in the MAINTAINERS file. Happy hacking! David
Re: Back End Responsibilities + RTL Generation
On 12/13/06, Frank Riese <[EMAIL PROTECTED]> wrote: One of my professors stated that a GCC Back End uses the Control Flow Graph as its input and that generation of RTL expressions occurs later on. That is not true. What roles do Back and Middle End play in generation of RTL? Would you consider the CFG or RTL expressions as the input for a GCC Back End? Let me first say that the definitions of front end, back end, and middle end are a bit hairy. You have to carefully define what you classify as belonging to the middle end or the back end. I actually try to avoid the terms nowadays. Also, you have to be specific about the version of GCC that you're talking about. GCC2, GCC3 and GCC4 are completely different internally, and even the differences between various GCC4 releases are quite significant. Anyway... The steps through the compiler are as follows: 1. front end runs, produces GENERIC 2. GENERIC is lowered to GIMPLE 3. a CFG is constructed for GIMPLE 4. GIMPLE (tree-ssa) optimizers run 5. GIMPLE is expanded to RTL, while preserving the CFG 6. RTL optimizers run 7. assembly is written out The RTL generation in step 5 is done one statement at a time. The part of the compiler that generates the RTL is a mix of shared code and of back end code: A single GIMPLE statement at a time is passed to the middle-end expand routines, which tries to produce RTL for this statement using instructions available on the target machine. The available instructions are defined by the target machine description (i.e. the back end). Try to understand cfgexpand.c and the section on named RTL patterns in the GCC internals manual. I also remembered having read the following line from the gcc internals documentation. However, I'm still not sure how to interpret this: "A control flow graph (CFG) is a data structure built on top of the intermediate code representation (the RTL or tree instruction stream) abstracting the control flow behavior of a function that is being compiled" Does that mean that a control flow graph is built after rtl has been generated or that information about that information about the control flow is incorporated into the RTL data structures? Neither. I'm assuming you're interested in how this works in recent GCC releases, i.e. GCC4 based. In GCC4, the control flow graph is built on GIMPLE, the tree-ssa optimizers need a CFG too. This CFG is kept up-to-date through the optimizers and through expansion to RTL. This means that GCC builds the CFG only once for each function. The data structures for the CFG are in basic-block.h. These data structures are most definitely *not* incorporated into the RTL structures. The CFG is independent of the intermediate representations for the function instructions. It has to be, or you could have the same CFG data structures for both GIMPLE and RTL. Hope this helps, Gr. Steven
Re: g++ doesn't unroll a loop it should unroll
Le mercredi 13 décembre 2006 23:09, Denis Vlasenko a écrit : > C++ doesn't specify that compiler shall unroll loops, so it cannot be > classified as "real" bug. OK, but then, even if I explicitly ask gcc to unroll loops with -funroll-loops, it still doesn't unroll them completely and is still as slow. See bug report here: http://gcc.gnu.org/bugzilla/show_bug.cgi?id=30201 > Re code: I would use memset + just a single no, in this example the numbers are double, but in my template library the type is a "typename T" and I can make no assumption as to the bit representation of static_cast(0). > loop anyway... you C++ people tend to overtax compiler with > optimizations. Is it really necessary to do (i == j) * factor > when (i == j) ? factor : 0 is easier for compiler to grok? Of course I tried it. It's even slower. Doesn't help the compiler unroll the loop, and now there's a branch at each iteration. > Template lib for vector and matrix math sounds like a performance > disaster in the making, at least for me. However, maybe you are > truly smart guy and can do miracles. I don't understand why you say that. At the language specification level, templates come with no inherent speed overhead. All of the template stuff is unfolded at compile time, none of it remains visible in the binary, so it shouldn't make the binary slower. Benoit pgphvVzwRwvyK.pgp Description: PGP signature
Re: g++ doesn't unroll a loop it should unroll
On 12/14/06, Benoît Jacob <[EMAIL PROTECTED]> wrote: I don't understand why you say that. At the language specification level, templates come with no inherent speed overhead. All of the template stuff is unfolded at compile time, none of it remains visible in the binary, so it shouldn't make the binary slower. You're confusing theory and practice... Gr. Steven