------- Comment #7 from michaelni at gmx dot at  2008-03-22 02:51 -------
You can also replace the inner loop by:

        "2:                         \n\t"
        "pxor %%mm1, %%mm1          \n\t"
        "movq  (%%eax, %%ecx), %%mm0\n\t"
        "psubw (%%esi, %%ecx), %%mm0\n\t"
        "pcmpgtw %%mm0, %%mm1       \n\t"
        "por     %%mm6, %%mm1       \n\t"
        "pmaddwd %%mm1, %%mm0       \n\t"
        "paddd %%mm0, %%mm7         \n\t"
        "addl $8, %%ecx             \n\t"
        " jnz 2b                    \n\t"

Which has one instruction less, its a hair faster on my p3 but a little slower
on my duron.
And of course the most obvious optimization is to unroll this and do a bunch of
them at once.


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=21395

Reply via email to