https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82147
Bug ID: 82147
Summary: Autovectorization for extraction is slower than done
manually
Product: gcc
Version: 8.0
Status: UNCONFIRMED
Keywords: missed-optimization
Severity: normal
Priority: P3
Component: target
Assignee: unassigned at gcc dot gnu.org
Reporter: pinskia at gcc dot gnu.org
Target Milestone: ---
Target: aarch64
Take:
void f(float *restrict a, float * restrict b, float * restrict c)
{
for(int i = 0; i< 1024;i++)
{
a[i] = c[i*2];
b[i] = c[i*2 + 1];
}
}
#define vector8 __attribute__((vector_size(8)))
void f1(float *restrict a, float * restrict b, float * restrict c)
{
for(int i = 0; i< 1024;i++)
{
vector8 float d = *(vector8 float *)&c[i*2];
a[i] = d[0];
b[i] = d[1];
}
}
--- CUT ---
I would have expected f and f1 produce the same code but f does ld2 followed by
two quad stores while f1 does a ldr(d) and then does a str(s) and st1(s). For
most processors, ld2/str(q)/str(q) is going to be slower than doing ldr/str/st1
as far as I can tell.
I noticed this after the last talk about the auto-vectorizing.