https://gcc.gnu.org/bugzilla/show_bug.cgi?id=79262
Bug ID: 79262 Summary: [6/7 Regression] load gap with store gap causing performance regression in 462.libquantum Product: gcc Version: 7.0 Status: UNCONFIRMED Keywords: missed-optimization Severity: normal Priority: P3 Component: tree-optimization Assignee: unassigned at gcc dot gnu.org Reporter: pinskia at gcc dot gnu.org Blocks: 53947 Target Milestone: --- Target: aarch64 As reported at https://gcc.gnu.org/bugzilla/show_bug.cgi?id=18438#c9 but what is not mentioned is that this is a regression from GCC 5. I noticed this again when I was working on improving ThunderX 2 CN99xx performance difference between -O2 and -Ofast and GCC 5.4.0 and the trunk. Take: struct node_struct { float _Complex gap; unsigned long long state; }; struct reg_struct { int size; struct node_struct *node; }; void func(int target, struct reg_struct *reg) { int i; for(i=0; i<reg->size; i++) reg->node[i].state ^= ((unsigned long long) 1 << target); } ---- CUT --- Currently this is vectorized on the trunk using load gaps but then the store is using scalars. This is much slower and also it is only doing 2 at a time. There are some cost model issues in the aarch64 backend dealing with scalar for int vs floating point too. I might just go fix those first. Referenced Bugs: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=53947 [Bug 53947] [meta-bug] vectorizer missed-optimizations