------- Comment #5 from ubizjak at gmail dot com 2007-07-12 08:22 ------- (In reply to comment #0)
> The loop for CallSumDeltas2 compiles to: > > .L7: > movdqa %xmm1, %xmm0 > pslldq $4, %xmm0 > addl $1, %eax > paddd %xmm1, %xmm0 > cmpl $100000000, %eax > movdqa %xmm0, %xmm1 > pslldq $8, %xmm1 > paddd %xmm1, %xmm0 > movdqa %xmm0, %xmm1 > movdqa %xmm0, foo1 > jne .L7 > > === > > This is two more movdqa then the hand-written code in CallSumDeltas3. paddd %xmm1, %xmm0 (2) movdqa %xmm0, %xmm1 (2) movdqa %xmm0, foo1 (1) jne .L7 (1) is assignment to a global variable. I'm not sure that it can be pushed out of the loop, but this can be solved by adding a local temporary in CallSumDeltas2(). (2) is probably regmove, failing to optimize: (set (reg:V4SI 21 xmm0 [72]) (plus:V4SI (reg:V4SI 21 xmm0 [69]) (reg:V4SI 22 xmm1 [71]))) 843 {*addv4si3} (nil)) (set (reg:V2DI 22 xmm1 [orig:73 foo1 ] [73]) (reg:V2DI 21 xmm0 [72])) 698 {*movv2di_internal} (nil)) into (set (reg:V4SI 21 xmm1 [72]) (plus:V4SI (reg:V4SI 21 xmm1 [69]) (reg:V4SI 22 xmm0 [71]))) 843 {*addv4si3} (nil)) -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=32735