I have occasionally examined asm generated by gcc on my amd64, and it appears gcc is not optimising recursive functions very well, though I'm not sure.
I find in code like: int f(int x) { ... f(y) ... } that gcc generates an extern ABI compliant function f, which is recursively called where shown, possibly first unrolling the recursion. This is a poor tactic. The right way to do this is: int f(int x) { int g(int x) { ... g(y) ... } g(x); } where the 'inner' function does not have to be ABI compliant. This code (generated by Felix): int ack( int x, int y){ start_5778:; if(!(x == 0 ==1) ) goto _5773; return y + 1 ; _5773:; if(!(y == 0 ==1) ) goto _5774; y = 1; x = x - 1 ; goto start_5778; _5774:; y = ack(x, y - 1 ); x = x - 1 ; goto start_5778; } is almost twice as fast as this one on AMD64 with -O3: int Ack(int M, int N) { if (M==0) return N +1; else if(N==0) return Ack(M-1,1); else return Ack(M-1, Ack(M,N-1)); } but basic SSA analysis should make these equivalent. The performance of this function is *entirely* determined by the number of words stacked per function call. The C case is correctly tail-rec optimised by gcc. I also note gcc unrolls the nontail recursion in both cases (three copies for the Felix version). I am quite puzzled by the massive performance improvement Felix can get by minor simplifications. There is one 'sophisticated' optimisation in the Felix version: the tail calls use parallel assignment: _5774:; y = ack(x, y - 1 ); x = x - 1 ; goto start_5778; The ordering of calculation of x and y here is no accident and is at least 25% faster than the other ordering, since that would require a temporary to hold the old x (the function call requires 3 words on the stack, the temporary would be a fourth). Because gcc 'unrolls' the recursion but then calls the ABI compliant top level function recursively, it is probably making the call ABI compliant and stacking caller save registers at the call site, and stacking and unstacking callee save registers at the call and return sites. So I am guessing the Felix version is lucky there are no gratuitous temporaries to be saved when this happens, and the C code is unlucky and there are. Maybe someone who knows how the optimiser works can comment? -- John Skaller <skaller at users dot sf dot net> Felix, successor to C++: http://felix.sf.net