http://gcc.gnu.org/bugzilla/show_bug.cgi?id=50182
--- Comment #24 from davidxl <xinliangli at gmail dot com> 2011-10-24 23:00:22 UTC --- (In reply to comment #23) > Here is the source preprocessed for gcc47. The test exhibits the > slowdown mentioned in comment 11. The problem can be reproduced with a simplified test case -- basically depending on how the result value from the inner loop is used in the outer loop (related to casting), the inner loop code is quite different - in the slow case, there are two redundant sign extension and a move instructions generated. # the fast version gcc -O3 -DFAST_VER bug.cpp ./a.out rv=4282167296 test description absolute operations ratio with number time per second test0 0 "int8_t constant add" 1.05 sec 1523.81 M 1.00 Total absolute time for int8_t constant folding: 1.05 sec # the slow version: gcc -O3 bug.cpp ./a.out rv=4282167296 test description absolute operations ratio with number time per second test0 0 "int8_t constant add" 1.57 sec 1019.11 M 1.00 Total absolute time for int8_t constant folding: 1.57 sec # however, when disabling inlining of check_shifted_sum_1 in the slow case, the runtime is recovered: gcc -O3 -DNOINLINE bug.cpp ./a.out rv=4282167296 test description absolute operations ratio with number time per second test0 0 "int8_t constant add" 1.05 sec 1523.81 M 1.00 Total absolute time for int8_t constant folding: 1.05 sec The inner loop body in faster case: .L60: movzbl 0(%rbp,%rcx), %r9d addq $1, %rcx cmpl %ecx, %ebx leal 10(%r8,%r9), %r8d # SUCC: 4 [91.0%] (dfs_back,can_fallthru) 5 [9.0%] (fallthru,can_fallthru,loop_exit) jg .L60 while for the slow case: .L60: movzbl (%r12,%rcx), %eax movsbl %r8b, %r8d addq $1, %rcx leal 10(%rax), %r9d movsbl %r9b, %r9d addl %r8d, %r9d cmpl %ecx, %ebp movl %r9d, %r8d # SUCC: 4 [91.0%] (dfs_back,can_fallthru) 5 [9.0%] (fallthru,can_fallthru,loop_exit) jg .L60 The relevant source change: #ifdef NOINLINE #define INL __attribute__((noinline)) #else #define INL inline #endif template <typename T, typename T2, typename Shifter> INL void check_shifted_sum_1(T2 result) { T temp = (T)SIZE * Shifter::do_shift((T)init_value); if (!tolerance_equal<T>((T&)result,temp)) printf("test %i failed\n", current_test); } #ifdef FAST_VER #define TYPE u_int32_t #else #define TYPE int8_t #endif template <typename T, typename Shifter> __attribute__((noinline)) u_int32_t test_constant(T* first, int count, const char *label) { int i; u_int32_t rv = 0; start_timer(); for (i = 0; i < iterations; ++i) { T result = 0; for (int n = 0; n < count; ++n) { result += Shifter::do_shift( first[n] ); } rv += result; check_shifted_sum_1<T, TYPE, Shifter>(result); } record_result( timer(), label ); return rv; }