I have occasionally examined asm generated by gcc on my amd64,
and it appears gcc is not optimising recursive functions very
well, though I'm not sure.

I find in code like:

        int f(int x) { ... f(y) ... }

that gcc generates an extern ABI compliant function f, 
which is recursively called where shown, possibly first
unrolling the recursion.

This is a poor tactic. The right way to do this is:

        int f(int x) {
                int g(int x) { ... g(y) ... }
                g(x);
        }

where the 'inner' function does not have to be ABI
compliant. This code (generated by Felix):

int ack( int x, int y){
    start_5778:;
      if(!(x == 0 ==1) ) goto _5773;
      return y + 1 ;
    _5773:;
      if(!(y == 0 ==1) ) goto _5774;
      y = 1;
      x = x - 1 ;
      goto start_5778;
    _5774:;
      y = ack(x, y - 1 );
      x = x - 1 ;
      goto start_5778;
}

is almost twice as fast as this one on AMD64 with -O3:

int Ack(int M, int N) {
  if (M==0) return N +1;
  else if(N==0) return Ack(M-1,1);
  else return Ack(M-1, Ack(M,N-1));
}

but basic SSA analysis should make these equivalent.
The performance of this function is *entirely* determined
by the number of words stacked per function call.

The C case is correctly tail-rec optimised by gcc.
I also note gcc unrolls the nontail recursion in both
cases (three copies for the Felix version).

I am quite puzzled by the massive performance improvement
Felix can get by minor simplifications.

There is one 'sophisticated' optimisation in the Felix
version: the tail calls use parallel assignment:

    _5774:;
      y = ack(x, y - 1 );
      x = x - 1 ;
      goto start_5778;

The ordering of calculation of x and y here is no accident
and is at least 25% faster than the other ordering, since
that would require a temporary to hold the old x
(the function call requires 3 words on the stack, the temporary
would be a fourth).

Because gcc 'unrolls' the recursion but then calls the
ABI compliant top level function recursively, it is
probably making the call ABI compliant and stacking
caller save registers at the call site, and stacking
and unstacking callee save registers at the call and 
return sites.

So I am guessing the Felix version is lucky there are
no gratuitous temporaries to be saved when this happens,
and the C code is unlucky and there are.

Maybe someone who knows how the optimiser works can comment?



-- 
John Skaller <skaller at users dot sf dot net>
Felix, successor to C++: http://felix.sf.net

Reply via email to