https://gcc.gnu.org/bugzilla/show_bug.cgi?id=68707
Bug ID: 68707 Summary: testcase gcc.dg/vect/O3-pr36098.c vectorized using VEC_PERM_EXPR rather than VEC_LOAD_LANES Product: gcc Version: 6.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: tree-optimization Assignee: unassigned at gcc dot gnu.org Reporter: alalaw01 at gcc dot gnu.org Target Milestone: --- Target: aarch64, arm Created attachment 36928 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=36928&action=edit tree-vect-details dump (before patch, with LOAD_LANES) Prior to r230993, O3-pr36098.c (at -O3) was vectorized using a LOAD_LANES / STORE_LANES, resulting in: .L5: ld4 {v4.4s - v7.4s}, [x7], 64 add w4, w4, 1 cmp w3, w4 orr v1.16b, v4.16b, v4.16b orr v2.16b, v5.16b, v5.16b orr v3.16b, v6.16b, v6.16b st3 {v1.4s - v3.4s}, [x6], 48 bhi .L5 each iteration of the outer loop processes a struct of 4 ints, of which the first 3 are copied to a destination. The ld4 nicely gets us four structs with all the elements we want in three registers row-wise (and the elements we don't want in a fourth): struct1 struct2 struct3 struct4 v4.s[0] v4.s[1] v4.s[2] v4.s[3] v5.s[0] v5.s[1] v5.s[2] v5.s[3] v6.s[0] v6.s[1] v6.s[2] v6.s[3] v7.s[0] v7.s[1] v7.s[2] v7.s[3] and st3 stores the desired rows (only) to the right locations. Following r230993, instead the loop gets unrolled four times, four vectors are loaded sequentially, and then permuted by SLP: .L5: ldr q0, [x5, 16] add x4, x4, 48 ldr q1, [x5, 32] add w6, w6, 1 ldr q4, [x5, 48] cmp w3, w6 ldr q2, [x5], 64 orr v3.16b, v0.16b, v0.16b orr v5.16b, v4.16b, v4.16b orr v4.16b, v1.16b, v1.16b tbl v0.16b, {v0.16b - v1.16b}, v6.16b tbl v2.16b, {v2.16b - v3.16b}, v7.16b tbl v4.16b, {v4.16b - v5.16b}, v16.16b str q0, [x4, -32] str q2, [x4, -48] str q4, [x4, -16] bhi .L5 that is, we load struct1 struct2 struct3 struct4 v2.s[0] v0.s[0] v1.s[0] v4.s[0] v2.s[1] v0.s[1] v1.s[1] v4.s[1] v2.s[2] v0.s[2] v1.s[2] v4.s[2] v2.s[3] v0.s[3] v1.s[3] v4.s[3] and then permute struct1 struct2 struct3 struct4 v2.s[0] v2.s[3] v0.s[2] v4.s[1] v2.s[1] v0.s[0] v0.s[3] v4.s[2] v2.s[2] v0.s[1] v4.s[0] v4.s[3] so we then have the data 'columnwise' and store each sequentially.