https://gcc.gnu.org/bugzilla/show_bug.cgi?id=119187
Bug ID: 119187 Summary: vectorizer should be able to SLP already vectorized code Product: gcc Version: unknown Status: UNCONFIRMED Keywords: missed-optimization Severity: normal Priority: P3 Component: tree-optimization Assignee: unassigned at gcc dot gnu.org Reporter: tnfchris at gcc dot gnu.org Target Milestone: --- Today there's a lot of code written as intrinsics for older microarchitectures that aren't optimal for newer designs. One of the promises of intrinsics is that the compiler should be able to do better if it knows it can. One way is to take advantage of all the cost modelling support in the vectorizer is to be able to vectorize already vectorized code. As an example: #include <arm_neon.h> void foo (uint8_t *a, uint8_t *b, uint8_t *c, int n) { for (int i = 0; i < n; i+=16, a+=16, b+=16, c+=16) { uint8x8_t av1 = vld1_u8 (a); uint8x8_t av2 = vld1_u8 (a+8); uint8x8_t bv1 = vld1_u8 (b); uint8x8_t bv2 = vld1_u8 (b+8); vst1_u8 (c, vadd_u8 (av1, bv1)); vst1_u8 (c+8, vadd_u8 (av2, bv2)); } } at -O3 generates: .L3: ldr d29, [x0, x4] ldr d28, [x1, x4] ldr d31, [x7, x4] ldr d30, [x6, x4] add v28.8b, v29.8b, v28.8b add v30.8b, v31.8b, v30.8b str d28, [x2, x4] str d30, [x5, x4] add x4, x4, 16 cmp w3, w4 bgt .L3 Which underutilizes the load bandwidth. Ideally these would be Q sized loads and one ADD. e.g. we'd SLP them. This ticket documents and asks for feedback on how to best do this. I have a WIP trunk that is able to re-vectorize the above into: .L4: ldr q29, [x0, x4] ldr q31, [x1, x4] ldr q0, [x10, x4] ldr q30, [x9, x4] add v31.16b, v29.16b, v31.16b add v30.16b, v0.16b, v30.16b str q31, [x2, x4] str q30, [x8, x4] add x4, x4, 16 cmp x7, x4 bne .L4 tst x5, 15 beq .L1 and w4, w5, -16 lsl w8, w4, 4 add x9, x2, w4, uxtw 4 add x5, x1, w4, uxtw 4 add x7, x0, w4, uxtw 4 .L3: sub w6, w6, w4 cmp w6, 6 bls .L6 ubfiz x4, x4, 4, 32 add w6, w6, 1 add x10, x4, 8 ldr d26, [x0, x4] ldr d28, [x1, x4] ldr d1, [x0, x10] ldr d27, [x1, x10] add v28.8b, v26.8b, v28.8b add v27.8b, v1.8b, v27.8b str d28, [x2, x4] str d27, [x2, x10] tst x6, 7 beq .L1 which happens mostly because it gets the unroll factor wrong and the loop increment is also not correct. however the SLP tree itself and the vectypes look correct: note: === vect_analyze_data_refs === note: got vectype for stmt: _13 = MEM <__Uint8x8_t> [(unsigned char * {ref-all})a_30]; vector(16) unsigned char note: got vectype for stmt: _14 = MEM <__Uint8x8_t> [(unsigned char * {ref-all})a_30 + 8B]; vector(16) unsigned char note: got vectype for stmt: _15 = MEM <__Uint8x8_t> [(unsigned char * {ref-all})b_31]; vector(16) unsigned char note: got vectype for stmt: _16 = MEM <__Uint8x8_t> [(unsigned char * {ref-all})b_31 + 8B]; vector(16) unsigned char note: got vectype for stmt: MEM <__Uint8x8_t> [(unsigned char * {ref-all})c_32] = _17; vector(16) unsigned char note: got vectype for stmt: MEM <__Uint8x8_t> [(unsigned char * {ref-all})c_32 + 8B] = _18; vector(16) unsigned char ... note: === vect_analyze_data_ref_accesses === note: Detected vector linear access in MEM <__Uint8x8_t> [(unsigned char * {ref-all})a_30] note: Detected vector linear access in MEM <__Uint8x8_t> [(unsigned char * {ref-all})a_30 + 8B] note: Detected vector linear access in MEM <__Uint8x8_t> [(unsigned char * {ref-all})b_31] note: Detected vector linear access in MEM <__Uint8x8_t> [(unsigned char * {ref-all})b_31 + 8B] note: Detected vector linear access in MEM <__Uint8x8_t> [(unsigned char * {ref-all})c_32] note: Detected vector linear access in MEM <__Uint8x8_t> [(unsigned char * {ref-all})c_32 + 8B] ... note: ==> examining statement: _13 = MEM <__Uint8x8_t> [(unsigned char * {ref-all})a_30]; note: precomputed vectype: vector(16) unsigned char note: get vectype for smallest scalar type: __Uint8x8_t note: nunits vectype: vector(16) unsigned char note: nunits = 16 note: ==> examining statement: _14 = MEM <__Uint8x8_t> [(unsigned char * {ref-all})a_30 + 8B]; note: precomputed vectype: vector(16) unsigned char note: get vectype for smallest scalar type: __Uint8x8_t note: nunits vectype: vector(16) unsigned char note: nunits = 16 ... costing is off though: note: === vect_compute_single_scalar_iteration_cost === MEM <__Uint8x8_t> [(unsigned char * {ref-all})a_30] 1 times scalar_load costs 1 in prologue MEM <__Uint8x8_t> [(unsigned char * {ref-all})a_30 + 8B] 1 times scalar_load costs 1 in prologue MEM <__Uint8x8_t> [(unsigned char * {ref-all})b_31] 1 times scalar_load costs 1 in prologue MEM <__Uint8x8_t> [(unsigned char * {ref-all})b_31 + 8B] 1 times scalar_load costs 1 in prologue _13 + _15 1 times scalar_stmt costs 1 in prologue _17 1 times scalar_store costs 1 in prologue _14 + _16 1 times scalar_stmt costs 1 in prologue _18 1 times scalar_store costs 1 in prologue and VF I think is wrong, I think VF=2 here since we consider the scalar mode to be V8QI no? or should we consider the scalar mode to be QI? in which case VF=16 is correct? Here I think the detected unroll factor is wrong, I'd expect unroll factor == 1: note: SLP graph after lowering permutations: note: node 0x5896350 (max_nunits=16, refcnt=2) vector(16) unsigned char note: op template: MEM <__Uint8x8_t> [(unsigned char * {ref-all})c_32] = _17; note: stmt 0 MEM <__Uint8x8_t> [(unsigned char * {ref-all})c_32] = _17; note: children 0x58963e8 note: node 0x58963e8 (max_nunits=16, refcnt=2) vector(16) unsigned char note: op template: _17 = _13 + _15; note: stmt 0 _17 = _13 + _15; note: children 0x5896480 0x5896518 note: node 0x5896480 (max_nunits=16, refcnt=2) vector(16) unsigned char note: op template: _13 = MEM <__Uint8x8_t> [(unsigned char * {ref-all})a_30]; note: stmt 0 _13 = MEM <__Uint8x8_t> [(unsigned char * {ref-all})a_30]; note: load permutation { 0 } note: node 0x5896518 (max_nunits=16, refcnt=2) vector(16) unsigned char note: op template: _15 = MEM <__Uint8x8_t> [(unsigned char * {ref-all})b_31]; note: stmt 0 _15 = MEM <__Uint8x8_t> [(unsigned char * {ref-all})b_31]; note: load permutation { 0 } note: node 0x58965b0 (max_nunits=16, refcnt=2) vector(16) unsigned char note: op template: MEM <__Uint8x8_t> [(unsigned char * {ref-all})c_32 + 8B] = _18; note: stmt 0 MEM <__Uint8x8_t> [(unsigned char * {ref-all})c_32 + 8B] = _18; note: children 0x5896648 note: node 0x5896648 (max_nunits=16, refcnt=2) vector(16) unsigned char note: op template: _18 = _14 + _16; note: stmt 0 _18 = _14 + _16; note: children 0x58966e0 0x5896778 note: node 0x58966e0 (max_nunits=16, refcnt=2) vector(16) unsigned char note: op template: _14 = MEM <__Uint8x8_t> [(unsigned char * {ref-all})a_30 + 8B]; note: stmt 0 _14 = MEM <__Uint8x8_t> [(unsigned char * {ref-all})a_30 + 8B]; note: load permutation { 0 } note: node 0x5896778 (max_nunits=16, refcnt=2) vector(16) unsigned char note: op template: _16 = MEM <__Uint8x8_t> [(unsigned char * {ref-all})b_31 + 8B]; note: stmt 0 _16 = MEM <__Uint8x8_t> [(unsigned char * {ref-all})b_31 + 8B]; note: load permutation { 0 } note: === vect_make_slp_decision === note: Decided to SLP 2 instances. Unrolling factor 16 which I think is what's causing the incorrect codegen. So far I've had to modify: * vect_analyze_group_access_1: Don't see vector loads as strided accesses unless there's a gap between group members * vect_analyze_data_ref_accesses: Don't consider vector loads as interleaving by default * vectorizable_operation, vectorizable_load: Check the scalar type precision rather than the "scalar vector" type. * get_related_vectype_for_scalar_type: Support vector types as scalar types. * get_related_vectype_for_scalar_type: Ditto * vect_get_vector_types_for_stmt: Allow vector inputs.