https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82147
Bug ID: 82147 Summary: Autovectorization for extraction is slower than done manually Product: gcc Version: 8.0 Status: UNCONFIRMED Keywords: missed-optimization Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: pinskia at gcc dot gnu.org Target Milestone: --- Target: aarch64 Take: void f(float *restrict a, float * restrict b, float * restrict c) { for(int i = 0; i< 1024;i++) { a[i] = c[i*2]; b[i] = c[i*2 + 1]; } } #define vector8 __attribute__((vector_size(8))) void f1(float *restrict a, float * restrict b, float * restrict c) { for(int i = 0; i< 1024;i++) { vector8 float d = *(vector8 float *)&c[i*2]; a[i] = d[0]; b[i] = d[1]; } } --- CUT --- I would have expected f and f1 produce the same code but f does ld2 followed by two quad stores while f1 does a ldr(d) and then does a str(s) and st1(s). For most processors, ld2/str(q)/str(q) is going to be slower than doing ldr/str/st1 as far as I can tell. I noticed this after the last talk about the auto-vectorizing.