https://gcc.gnu.org/bugzilla/show_bug.cgi?id=117031

--- Comment #3 from Tamar Christina <tnfchris at gcc dot gnu.org> ---
(In reply to Richard Biener from comment #2)
> (In reply to Tamar Christina from comment #0)
> > GCC seems to miss that there is no gap between the group accesses and that
> > stride == 1.
> > test3 is vectorized linearly by GCC, so it seems this is missed optimization
> > in data ref analysis?
> 
> The load-lanes look fine, so it must be the code generation for the
> HI to DI via SI conversions using unpacks you are complaining about?
> 

No, that one I have a patch for.

> Using load-lanes is natural here.
> 
> This isn't about permutes due to VF or so, isn't it?

It is, the load lanes is unnecessary, because there is no permute during the
loop because the group size is equal to the stride and offsets are linear.

LOAD_LANES are really expensive, especially 4 register ones.

My complaint is that this loop, does not have a permute.  While it may look
like the entries are permuted they are not.

essentially test1 and test3 are the same. the vectorizer picks VF=8, so unrolls
test1 into test3, but fails to see that the unrolled code is linear, but when
manually unrolled it does:

e.g.

void
test3 (unsigned short *x, double *y, int n)
{
    for (int i = 0; i < n; i+=2)
        {
            unsigned short a1 = x[i * 4 + 0];
            unsigned short b1 = x[i * 4 + 1];
            unsigned short c1 = x[i * 4 + 2];
            unsigned short d1 = x[i * 4 + 3];
            y[i+0] = (double)a1 + (double)b1 + (double)c1 + (double)d1;
            unsigned short a2 = x[(i + 1) * 4 + 0];
            unsigned short b2 = x[(i + 1) * 4 + 1];
            unsigned short c2 = x[(i + 1) * 4 + 2];
            unsigned short d2 = x[(i + 1) * 4 + 3];
            y[i+1] = (double)a2 + (double)b2 + (double)c2 + (double)d2;
        }
}

does not use LOAD_LANES.

Reply via email to