https://gcc.gnu.org/bugzilla/show_bug.cgi?id=68494
--- Comment #2 from Michael Collison <michael.collison at linaro dot org> ---
Sorry here is the updated test case.
#define NTAPS 4
short taps[NTAPS];
void fir_t5(int len, short * __restrict p, short *__restrict x, short
*__restrict taps)
{
len = len & ~31;
for (int i = 0; i < len; i++)
{
int tmp = 0;
for (int j = 0; j < NTAPS; j++)
{
tmp += x[i - j] * taps[j];
}
p[i] = tmp;
}
}
--------------------------------------------------------------------------------
We currently generate a vdup of the scalar taps[j] in the inner loop. Ideally
we do not use the vdup and insted use a vmul using a lane.