------- Comment #7 from michaelni at gmx dot at 2008-03-22 02:51 ------- You can also replace the inner loop by:
"2: \n\t" "pxor %%mm1, %%mm1 \n\t" "movq (%%eax, %%ecx), %%mm0\n\t" "psubw (%%esi, %%ecx), %%mm0\n\t" "pcmpgtw %%mm0, %%mm1 \n\t" "por %%mm6, %%mm1 \n\t" "pmaddwd %%mm1, %%mm0 \n\t" "paddd %%mm0, %%mm7 \n\t" "addl $8, %%ecx \n\t" " jnz 2b \n\t" Which has one instruction less, its a hair faster on my p3 but a little slower on my duron. And of course the most obvious optimization is to unroll this and do a bunch of them at once. -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=21395