O3-pr36098.c vectorized using VEC_PERM_EXPR rather than VEC_LOAD_LANES

alalaw01 at gcc dot gnu.org Fri, 04 Dec 2015 10:21:14 -0800

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=68707


            Bug ID: 68707
           Summary: testcase gcc.dg/vect/O3-pr36098.c vectorized using
                    VEC_PERM_EXPR rather than VEC_LOAD_LANES
           Product: gcc
           Version: 6.0
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: tree-optimization
          Assignee: unassigned at gcc dot gnu.org
          Reporter: alalaw01 at gcc dot gnu.org
  Target Milestone: ---
            Target: aarch64, arm

Created attachment 36928
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=36928&action=edit
tree-vect-details dump (before patch, with LOAD_LANES)

Prior to r230993, O3-pr36098.c (at -O3) was vectorized using a LOAD_LANES /
STORE_LANES, resulting in:

.L5:
        ld4     {v4.4s - v7.4s}, [x7], 64
        add     w4, w4, 1
        cmp     w3, w4
        orr     v1.16b, v4.16b, v4.16b
        orr     v2.16b, v5.16b, v5.16b
        orr     v3.16b, v6.16b, v6.16b
        st3     {v1.4s - v3.4s}, [x6], 48
        bhi     .L5

each iteration of the outer loop processes a struct of 4 ints, of which the
first 3 are copied to a destination. The ld4 nicely gets us four structs with
all the elements we want in three registers row-wise (and the elements we don't
want in a fourth):
struct1 struct2 struct3 struct4
v4.s[0] v4.s[1] v4.s[2] v4.s[3]
v5.s[0] v5.s[1] v5.s[2] v5.s[3]
v6.s[0] v6.s[1] v6.s[2] v6.s[3]
v7.s[0] v7.s[1] v7.s[2] v7.s[3]
and st3 stores the desired rows (only) to the right locations.

Following r230993, instead the loop gets unrolled four times, four vectors are
loaded sequentially, and then permuted by SLP:

.L5:
        ldr     q0, [x5, 16]
        add     x4, x4, 48
        ldr     q1, [x5, 32]
        add     w6, w6, 1
        ldr     q4, [x5, 48]
        cmp     w3, w6
        ldr     q2, [x5], 64
        orr     v3.16b, v0.16b, v0.16b
        orr     v5.16b, v4.16b, v4.16b
        orr     v4.16b, v1.16b, v1.16b
        tbl     v0.16b, {v0.16b - v1.16b}, v6.16b
        tbl     v2.16b, {v2.16b - v3.16b}, v7.16b
        tbl     v4.16b, {v4.16b - v5.16b}, v16.16b
        str     q0, [x4, -32]
        str     q2, [x4, -48]
        str     q4, [x4, -16]
        bhi     .L5

that is, we load

struct1 struct2 struct3 struct4
v2.s[0] v0.s[0] v1.s[0] v4.s[0]
v2.s[1] v0.s[1] v1.s[1] v4.s[1]
v2.s[2] v0.s[2] v1.s[2] v4.s[2]
v2.s[3] v0.s[3] v1.s[3] v4.s[3]

and then permute

struct1 struct2 struct3 struct4
v2.s[0] v2.s[3] v0.s[2] v4.s[1]
v2.s[1] v0.s[0] v0.s[3] v4.s[2]
v2.s[2] v0.s[1] v4.s[0] v4.s[3]

so we then have the data 'columnwise' and store each sequentially.

[Bug tree-optimization/68707] New: testcase gcc.dg/vect/O3-pr36098.c vectorized using VEC_PERM_EXPR rather than VEC_LOAD_LANES

Reply via email to