https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82147

            Bug ID: 82147
           Summary: Autovectorization for extraction is slower than done
                    manually
           Product: gcc
           Version: 8.0
            Status: UNCONFIRMED
          Keywords: missed-optimization
          Severity: normal
          Priority: P3
         Component: target
          Assignee: unassigned at gcc dot gnu.org
          Reporter: pinskia at gcc dot gnu.org
  Target Milestone: ---
            Target: aarch64

Take:
void f(float *restrict a, float * restrict b, float * restrict c)
{
  for(int i = 0; i< 1024;i++)
    {
      a[i] = c[i*2];
      b[i] = c[i*2 + 1];
    }
}

#define vector8 __attribute__((vector_size(8)))

void f1(float *restrict a, float * restrict b, float * restrict c)
{
  for(int i = 0; i< 1024;i++)
    {
      vector8 float d = *(vector8 float *)&c[i*2];
      a[i] = d[0];
      b[i] = d[1];
    }
}
--- CUT ---
I would have expected f and f1 produce the same code but f does ld2 followed by
two quad stores while f1 does a ldr(d) and then does a str(s) and st1(s).  For
most processors, ld2/str(q)/str(q) is going to be slower than doing ldr/str/st1
as far as I can tell.

I noticed this after the last talk about the auto-vectorizing.

Reply via email to