------- Comment #9 from jakub at gcc dot gnu dot org 2007-11-22 16:41 ------- On x86_64-linux -m64 with -O2 gcc doesn't hoist movabsq insns out of the loops, which can give some performance back:
time ./pr23305-slow real 0m4.028s user 0m4.023s sys 0m0.003s time ./pr23305-slow2 real 0m3.436s user 0m3.434s sys 0m0.001s when I hoist it by hand in assembly: --- pr23305-slow.s 2007-11-22 17:14:09.000000000 +0100 +++ pr23305-slow2.s 2007-11-22 17:31:31.000000000 +0100 @@ -222,16 +222,16 @@ _Z13s000005a_testv: .LVL2: .LBB329: .LBB330: .loc 1 28697 0 cmpq %rax, %rdx je .L13 + movabsq $4613937818241073152, %r8 .p2align 4,,10 .p2align 3 .L14: - movabsq $4613937818241073152, %r8 movq %r8, (%rax) addq $8, %rax cmpq %rax, %rdx jne .L14 .L13: .LBE330: @@ -242,17 +242,17 @@ _Z13s000005a_testv: .LVL3: .LBB326: .LBB327: .loc 1 28697 0 cmpq %rax, %rdx je .L15 + movabsq $4613937818241073152, %rdi .p2align 4,,10 .p2align 3 .L16: .LBE327: - movabsq $4613937818241073152, %rdi movq %rdi, (%rax) .LBB328: addq $8, %rax cmpq %rax, %rdx jne .L16 .L15: but still the -O2 -fno-inline-small-functions version is much faster: time ./pr23305-fast real 0m1.591s user 0m1.588s sys 0m0.001s -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=23305