------- Comment #7 from michaelni at gmx dot at 2008-03-22 02:51 -------
You can also replace the inner loop by:
"2: \n\t"
"pxor %%mm1, %%mm1 \n\t"
"movq (%%eax, %%ecx), %%mm0\n\t"
"psubw (%%esi, %%ecx), %%mm0\n\t"
"pcmpgtw %%mm0, %%mm1 \n\t"
"por %%mm6, %%mm1 \n\t"
"pmaddwd %%mm1, %%mm0 \n\t"
"paddd %%mm0, %%mm7 \n\t"
"addl $8, %%ecx \n\t"
" jnz 2b \n\t"
Which has one instruction less, its a hair faster on my p3 but a little slower
on my duron.
And of course the most obvious optimization is to unroll this and do a bunch of
them at once.
--
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=21395