http://gcc.gnu.org/bugzilla/show_bug.cgi?id=53533
--- Comment #7 from Richard Guenther <rguenth at gcc dot gnu.org> 2012-06-12 10:11:51 UTC --- Btw, when I run the benchmark with the addition of -march=native (for me, that's -march=corei7) then GCC 4.7 performs better than 4.6: 4.6: ./t 100000 test description absolute operations ratio with number time per second test0 0 "int32_t for loop unroll 1" 0.41 sec 1951.22 M 1.00 1 "int32_t for loop unroll 2" 0.51 sec 1568.63 M 1.24 2 "int32_t for loop unroll 3" 0.47 sec 1702.13 M 1.15 3 "int32_t for loop unroll 4" 0.48 sec 1666.67 M 1.17 4 "int32_t for loop unroll 5" 0.47 sec 1702.13 M 1.15 5 "int32_t for loop unroll 6" 0.51 sec 1568.63 M 1.24 6 "int32_t for loop unroll 7" 0.47 sec 1702.13 M 1.15 7 "int32_t for loop unroll 8" 0.47 sec 1702.13 M 1.15 Total absolute time for int32_t for loop unrolling: 3.79 sec 4.7: ./t 100000 test description absolute operations ratio with number time per second test0 0 "int32_t for loop unroll 1" 0.39 sec 2051.28 M 1.00 1 "int32_t for loop unroll 2" 0.40 sec 2000.00 M 1.03 2 "int32_t for loop unroll 3" 0.39 sec 2051.28 M 1.00 3 "int32_t for loop unroll 4" 0.39 sec 2051.28 M 1.00 4 "int32_t for loop unroll 5" 0.38 sec 2105.26 M 0.97 5 "int32_t for loop unroll 6" 0.41 sec 1951.22 M 1.05 6 "int32_t for loop unroll 7" 0.37 sec 2162.16 M 0.95 7 "int32_t for loop unroll 8" 0.36 sec 2222.22 M 0.92 Total absolute time for int32_t for loop unrolling: 3.09 sec The loop then looks like (the expected) .L53: movdqa (%rax), %xmm4 paddd %xmm3, %xmm4 pmulld %xmm0, %xmm4 paddd %xmm1, %xmm4 pmulld %xmm0, %xmm4 paddd %xmm1, %xmm4 pmulld %xmm0, %xmm4 paddd %xmm2, %xmm4 paddd %xmm4, %xmm6 movdqa 16(%rax), %xmm4 addq $32, %rax cmpq $data32+32000, %rax paddd %xmm3, %xmm4 pmulld %xmm0, %xmm4 paddd %xmm1, %xmm4 pmulld %xmm0, %xmm4 paddd %xmm1, %xmm4 pmulld %xmm0, %xmm4 paddd %xmm2, %xmm4 paddd %xmm4, %xmm5 jne .L53 looks like pmulld is only available with SSE 4.1 and otherwise we fall back to the define_insn_and_split "*sse2_mulv4si3". But that complexity is not reflected in the vectorizer cost model (which needs improvement ...).