http://gcc.gnu.org/bugzilla/show_bug.cgi?id=56717

--- Comment #2 from Cong Hou <congh at google dot com> ---
I examined the GCC generated code, and found the main problem is that the load
of 'scale' (rhs operand of >>) to an xmm register is in the loop body, which
could be moved outside.

This happened during rtl-reload pass. For the following code, the load to scale
is still outside of the loop body.


void foo(short* a, short scale, int n) {
  int i;
  for (i=0; i<n; i++)
    a[i] = a[i] >> scale;
}


But for your code here, it is not. I suspect there may exist some issue in that
pass.

By the way, from my test it turns out that using PMADDWD is no faster than the
way used by GCC now.

Reply via email to