------- Comment #5 from adam at consulting dot net dot nz 2010-09-13 00:24 ------- Andrew Pinski wrote:
>This is caused by revision 160124: Not really, it is a noreturn function so the behavior is correct for our policy of allowing a more correct backtrace for noreturn functions. I'm not sure what you're trying to say here Andrew. Are you trying to justify -O3 generating slower code to simplify debugging? The testcase is a incorrect one based on size If you mean zero-extension of 32-bit function pointers, this is the x86-64 small code model. If you mean that you don't care that the testcase increased in size without further benchmarking then empirical analysis is actually unnecessary. The generated assembly is clearly worse. and not really that interesting anymore with respect of global register variables. It's another example of global register variables being copied for no good reason whatsoever. RAX is free and the obvious translation of uint32_t next = Iptr[1]; to x86-64 assembly is mov eax,DWORD PTR [rbp+0x4]; (Intel syntax, where RBP is the global register variable). Generating mov rax,rbp; mov eax,DWORD PTR [rax+0x4]; is just dumb. I've been experimenting with optimal forms of virtual machine dispatch for a long time and what you have is a fragment of a very fast direct threaded interpreter. So fast in fact that a type-safe countdown will execute at 5 cycles per iteration on Intel Core 2: #include <assert.h> #include <stdint.h> #include <stdlib.h> #define LIKELY(x) __builtin_expect(!!(x), 1) #define UNLIKELY(x) __builtin_expect(!!(x), 0) register uint32_t *Iptr __asm__("rbp"); typedef void (*inst_t)(uint64_t types, uint64_t a, uint64_t b); #define FUNC(x) ((inst_t) (uint64_t) x) #define INST(x) ((uint32_t) (uint64_t) x) __attribute__ ((noinline)) void dec_helper(uint64_t types, uint64_t a, uint64_t b) { assert("FIXME"==""); } void dec(uint64_t types, uint64_t a, uint64_t b) { if (LIKELY((types & 0xFF) == 1)) { uint32_t next = Iptr[1]; --a; ++Iptr; FUNC(next)(types, a, b); } else dec_helper(types, a, b); } __attribute__ ((noinline)) void if_not_equal_jump_back_1_helper(uint64_t types, uint64_t a, uint64_t b) { assert("FIXME"==""); } void if_not_equal_jump_back_1(uint64_t types, uint64_t a, uint64_t b) { if (LIKELY((types & 0xFFFF) == 0x0101)) { if (LIKELY(a != b)) { uint32_t next = Iptr[-1]; --Iptr; FUNC(next)(types, a, b); } else { uint32_t next = Iptr[1]; ++Iptr; FUNC(next)(types, a, b); } } else if_not_equal_jump_back_1_helper(types, a, b); } void unconditional_exit(uint64_t types, uint64_t a, uint64_t b) { exit(0); } __attribute__ ((noinline, noclone)) void execute(uint32_t *code, uint64_t types, uint64_t a, uint64_t b) { Iptr = code; FUNC(code[0])(types, a, b); } int main() { uint32_t code[]={INST(dec), INST(if_not_equal_jump_back_1), INST(unconditional_exit)}; execute(code + 1, 0x0101, 3000000000, 0); return 0; } $ gcc-4.5 -O3 -std=gnu99 plain-32bit-direct-dispatch-countdown.c && time ./a.out real 0m5.007s user 0m4.996s sys 0m0.004s CPU is 3GHz. Code execution starts at the second instruction (if_not_equal_jump_back_1). a==3000000000 of type==1 is not equal to b==0 of type==1 (the two type comparisons are performed in parallel in one cycle without masking since one can compare the low 8-, 16- or 32-bits of a 64-bit register without masking and the two types are packed into the low 16-bits of the types register). As a!=b the code jumps back to the dec instruction, which performs another type check that a is of type==1 before decrementing a and jumping to if_not_equal_jump_back_1. This continues until a==0 and program exit occurs. While the generated assembly of GCC snapshot speaks for itself, here's some empirical evidence of its inferiority: $ gcc-snapshot.sh -O3 -std=gnu99 plain-32bit-direct-dispatch-countdown.c && time ./a.out real 0m10.014s user 0m10.009s sys 0m0.000s GCC snapshot has doubled the execution time of this virtual machine example (compared to gcc-4.3, gcc-4.4 and gcc-4.5). -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=44281