https://gcc.gnu.org/bugzilla/show_bug.cgi?id=119187

            Bug ID: 119187
           Summary: vectorizer should be able to SLP already vectorized
                    code
           Product: gcc
           Version: unknown
            Status: UNCONFIRMED
          Keywords: missed-optimization
          Severity: normal
          Priority: P3
         Component: tree-optimization
          Assignee: unassigned at gcc dot gnu.org
          Reporter: tnfchris at gcc dot gnu.org
  Target Milestone: ---

Today there's a lot of code written as intrinsics for older microarchitectures
that aren't optimal for newer designs.

One of the promises of intrinsics is that the compiler should be able to do
better if it knows it can.  One way is to take advantage of all the cost
modelling support in the vectorizer is to be able to vectorize already
vectorized code.

As an example:

#include <arm_neon.h>

void foo (uint8_t *a, uint8_t *b, uint8_t *c, int n)
{
  for (int i = 0; i < n; i+=16, a+=16, b+=16, c+=16)
    {
        uint8x8_t av1 = vld1_u8 (a);
        uint8x8_t av2 = vld1_u8 (a+8);
        uint8x8_t bv1 = vld1_u8 (b);
        uint8x8_t bv2 = vld1_u8 (b+8);
        vst1_u8 (c, vadd_u8 (av1, bv1));
        vst1_u8 (c+8, vadd_u8 (av2, bv2));
    }
}

at -O3 generates:

.L3:
        ldr     d29, [x0, x4]
        ldr     d28, [x1, x4]
        ldr     d31, [x7, x4]
        ldr     d30, [x6, x4]
        add     v28.8b, v29.8b, v28.8b
        add     v30.8b, v31.8b, v30.8b
        str     d28, [x2, x4]
        str     d30, [x5, x4]
        add     x4, x4, 16
        cmp     w3, w4
        bgt     .L3

Which underutilizes the load bandwidth.  Ideally these would be Q sized loads
and one ADD. e.g. we'd SLP them.

This ticket documents and asks for feedback on how to best do this.

I have a WIP trunk that is able to re-vectorize the above into:

.L4:
        ldr     q29, [x0, x4]
        ldr     q31, [x1, x4]
        ldr     q0, [x10, x4]
        ldr     q30, [x9, x4]
        add     v31.16b, v29.16b, v31.16b
        add     v30.16b, v0.16b, v30.16b
        str     q31, [x2, x4]
        str     q30, [x8, x4]
        add     x4, x4, 16
        cmp     x7, x4
        bne     .L4
        tst     x5, 15
        beq     .L1
        and     w4, w5, -16
        lsl     w8, w4, 4
        add     x9, x2, w4, uxtw 4
        add     x5, x1, w4, uxtw 4
        add     x7, x0, w4, uxtw 4
.L3:
        sub     w6, w6, w4
        cmp     w6, 6
        bls     .L6
        ubfiz   x4, x4, 4, 32
        add     w6, w6, 1
        add     x10, x4, 8
        ldr     d26, [x0, x4]
        ldr     d28, [x1, x4]
        ldr     d1, [x0, x10]
        ldr     d27, [x1, x10]
        add     v28.8b, v26.8b, v28.8b
        add     v27.8b, v1.8b, v27.8b
        str     d28, [x2, x4]
        str     d27, [x2, x10]
        tst     x6, 7
        beq     .L1

which happens mostly because it gets the unroll factor wrong and the loop
increment is also not correct.  however the SLP tree itself and the vectypes
look correct:

 note:   === vect_analyze_data_refs ===
 note:   got vectype for stmt: _13 = MEM <__Uint8x8_t> [(unsigned char *
{ref-all})a_30];
vector(16) unsigned char
 note:   got vectype for stmt: _14 = MEM <__Uint8x8_t> [(unsigned char *
{ref-all})a_30 + 8B];
vector(16) unsigned char
 note:   got vectype for stmt: _15 = MEM <__Uint8x8_t> [(unsigned char *
{ref-all})b_31];
vector(16) unsigned char
 note:   got vectype for stmt: _16 = MEM <__Uint8x8_t> [(unsigned char *
{ref-all})b_31 + 8B];
vector(16) unsigned char
 note:   got vectype for stmt: MEM <__Uint8x8_t> [(unsigned char *
{ref-all})c_32] = _17;
vector(16) unsigned char
 note:   got vectype for stmt: MEM <__Uint8x8_t> [(unsigned char *
{ref-all})c_32 + 8B] = _18;
vector(16) unsigned char

...

 note:   === vect_analyze_data_ref_accesses ===
 note:   Detected vector linear access in MEM <__Uint8x8_t> [(unsigned char *
{ref-all})a_30]
 note:   Detected vector linear access in MEM <__Uint8x8_t> [(unsigned char *
{ref-all})a_30 + 8B]
 note:   Detected vector linear access in MEM <__Uint8x8_t> [(unsigned char *
{ref-all})b_31]
 note:   Detected vector linear access in MEM <__Uint8x8_t> [(unsigned char *
{ref-all})b_31 + 8B]
 note:   Detected vector linear access in MEM <__Uint8x8_t> [(unsigned char *
{ref-all})c_32]
 note:   Detected vector linear access in MEM <__Uint8x8_t> [(unsigned char *
{ref-all})c_32 + 8B]

...

 note:   ==> examining statement: _13 = MEM <__Uint8x8_t> [(unsigned char *
{ref-all})a_30];
 note:   precomputed vectype: vector(16) unsigned char
 note:   get vectype for smallest scalar type: __Uint8x8_t
 note:   nunits vectype: vector(16) unsigned char
 note:   nunits = 16
 note:   ==> examining statement: _14 = MEM <__Uint8x8_t> [(unsigned char *
{ref-all})a_30 + 8B];
 note:   precomputed vectype: vector(16) unsigned char
 note:   get vectype for smallest scalar type: __Uint8x8_t
 note:   nunits vectype: vector(16) unsigned char
 note:   nunits = 16

...

costing is off though:

 note:   === vect_compute_single_scalar_iteration_cost ===
MEM <__Uint8x8_t> [(unsigned char * {ref-all})a_30] 1 times scalar_load costs 1
in prologue
MEM <__Uint8x8_t> [(unsigned char * {ref-all})a_30 + 8B] 1 times scalar_load
costs 1 in prologue
MEM <__Uint8x8_t> [(unsigned char * {ref-all})b_31] 1 times scalar_load costs 1
in prologue
MEM <__Uint8x8_t> [(unsigned char * {ref-all})b_31 + 8B] 1 times scalar_load
costs 1 in prologue
_13 + _15 1 times scalar_stmt costs 1 in prologue
_17 1 times scalar_store costs 1 in prologue
_14 + _16 1 times scalar_stmt costs 1 in prologue
_18 1 times scalar_store costs 1 in prologue

and VF I think is wrong, I think VF=2 here since we consider the scalar mode to
be V8QI no? or should we consider the scalar mode to be QI? in which case VF=16
is correct?

Here I think the detected unroll factor is wrong, I'd expect unroll factor ==
1:

 note:   SLP graph after lowering permutations:
 note:   node 0x5896350 (max_nunits=16, refcnt=2) vector(16) unsigned char
 note:   op template: MEM <__Uint8x8_t> [(unsigned char * {ref-all})c_32] =
_17;
 note:        stmt 0 MEM <__Uint8x8_t> [(unsigned char * {ref-all})c_32] = _17;
 note:        children 0x58963e8
 note:   node 0x58963e8 (max_nunits=16, refcnt=2) vector(16) unsigned char
 note:   op template: _17 = _13 + _15;
 note:        stmt 0 _17 = _13 + _15;
 note:        children 0x5896480 0x5896518
 note:   node 0x5896480 (max_nunits=16, refcnt=2) vector(16) unsigned char
 note:   op template: _13 = MEM <__Uint8x8_t> [(unsigned char *
{ref-all})a_30];
 note:        stmt 0 _13 = MEM <__Uint8x8_t> [(unsigned char * {ref-all})a_30];
 note:        load permutation { 0 }
 note:   node 0x5896518 (max_nunits=16, refcnt=2) vector(16) unsigned char
 note:   op template: _15 = MEM <__Uint8x8_t> [(unsigned char *
{ref-all})b_31];
 note:        stmt 0 _15 = MEM <__Uint8x8_t> [(unsigned char * {ref-all})b_31];
 note:        load permutation { 0 }
 note:   node 0x58965b0 (max_nunits=16, refcnt=2) vector(16) unsigned char
 note:   op template: MEM <__Uint8x8_t> [(unsigned char * {ref-all})c_32 + 8B]
= _18;
 note:        stmt 0 MEM <__Uint8x8_t> [(unsigned char * {ref-all})c_32 + 8B] =
_18;
 note:        children 0x5896648
 note:   node 0x5896648 (max_nunits=16, refcnt=2) vector(16) unsigned char
 note:   op template: _18 = _14 + _16;
 note:        stmt 0 _18 = _14 + _16;
 note:        children 0x58966e0 0x5896778
 note:   node 0x58966e0 (max_nunits=16, refcnt=2) vector(16) unsigned char
 note:   op template: _14 = MEM <__Uint8x8_t> [(unsigned char * {ref-all})a_30
+ 8B];
 note:        stmt 0 _14 = MEM <__Uint8x8_t> [(unsigned char * {ref-all})a_30 +
8B];
 note:        load permutation { 0 }
 note:   node 0x5896778 (max_nunits=16, refcnt=2) vector(16) unsigned char
 note:   op template: _16 = MEM <__Uint8x8_t> [(unsigned char * {ref-all})b_31
+ 8B];
 note:        stmt 0 _16 = MEM <__Uint8x8_t> [(unsigned char * {ref-all})b_31 +
8B];
 note:        load permutation { 0 }
 note:   === vect_make_slp_decision ===
 note:   Decided to SLP 2 instances. Unrolling factor 16

which I think is what's causing the incorrect codegen.

So far I've had to modify:
 * vect_analyze_group_access_1: Don't see vector loads as strided accesses
unless there's a gap between group members
 * vect_analyze_data_ref_accesses: Don't consider vector loads as interleaving
by default
 * vectorizable_operation, vectorizable_load: Check the scalar type precision
rather than the "scalar vector" type.
 * get_related_vectype_for_scalar_type: Support vector types as scalar types.
 * get_related_vectype_for_scalar_type: Ditto
 * vect_get_vector_types_for_stmt: Allow vector inputs.

Reply via email to