https://gcc.gnu.org/bugzilla/show_bug.cgi?id=117031

            Bug ID: 117031
           Summary: increasing VF during SLP vectorization permutes
                    unnecessarily
           Product: gcc
           Version: 15.0
            Status: UNCONFIRMED
          Keywords: missed-optimization
          Severity: normal
          Priority: P3
         Component: tree-optimization
          Assignee: unassigned at gcc dot gnu.org
          Reporter: tnfchris at gcc dot gnu.org
  Target Milestone: ---

The following testcase:

---
void
test1 (unsigned short *x, double *y, int n)
{
    for (int i = 0; i < n; i++)
        {
            unsigned short a = x[i * 4 + 0];
            unsigned short b = x[i * 4 + 1];
            unsigned short c = x[i * 4 + 2];
            unsigned short d = x[i * 4 + 3];
            y[i] = (double)a + (double)b + (double)c + (double)d;
        }
}
---

at -O3 vectorizes using LOAD_LANES on aarch64:

  vect_array.11 = .LOAD_LANES (MEM <short unsigned int[32]> [(short unsigned
int *)vectp_x.9_123]);
  vect_a_29.12_125 = vect_array.11[0];
  vect__14.17_129 = [vec_unpack_lo_expr] vect_a_29.12_125;
  vect__14.17_130 = [vec_unpack_hi_expr] vect_a_29.12_125;
  vect__14.16_131 = [vec_unpack_lo_expr] vect__14.17_129;
  vect__14.16_132 = [vec_unpack_hi_expr] vect__14.17_129;
  vect__14.16_133 = [vec_unpack_lo_expr] vect__14.17_130;
  vect__14.16_134 = [vec_unpack_hi_expr] vect__14.17_130;
  vect__14.18_135 = (vector(2) double) vect__14.16_131;
  vect__14.18_136 = (vector(2) double) vect__14.16_132;
  vect__14.18_137 = (vector(2) double) vect__14.16_133;
  vect__14.18_138 = (vector(2) double) vect__14.16_134;
...


because input type is 4 shorts, so V4HI is the natural size. V4HI fails to
vectorize because
we don't support direct conversion from V4HI to V4SI.

We then pick a higher VF (V8HI) and the loads are detected as interleaving.
LLVM however avoids
the permute here by detecting that the unrolling doesn't result in a permuted
access as it's
equivalent to:

void
test3 (unsigned short *x, double *y, int n)
{
    for (int i = 0; i < n; i+=2)
        {
            unsigned short a1 = x[i * 4 + 0];
            unsigned short b1 = x[i * 4 + 1];
            unsigned short c1 = x[i * 4 + 2];
            unsigned short d1 = x[i * 4 + 3];
            y[i+0] = (double)a1 + (double)b1 + (double)c1 + (double)d1;
            unsigned short a2 = x[i * 4 + 4];
            unsigned short b2 = x[i * 4 + 5];
            unsigned short c2 = x[i * 4 + 6];
            unsigned short d2 = x[i * 4 + 7];
            y[i+1] = (double)a2 + (double)b2 + (double)c2 + (double)d2;
        }
}

GCC seems to miss that there is no gap between the group accesses and that
stride == 1.
test3 is vectorized linearly by GCC, so it seems this is missed optimization in
data ref analysis?

Reply via email to