https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98772
Bug ID: 98772 Summary: Widening patterns causing missed vectorization Product: gcc Version: unknown Status: UNCONFIRMED Severity: enhancement Priority: P3 Component: tree-optimization Assignee: unassigned at gcc dot gnu.org Reporter: joelh at gcc dot gnu.org Target Milestone: --- Disabling widening patterns (widening_mult, widening_plus, widening_minus) allows some testcases to be vectorized better. Currently mixed scalar and vector code is produced, due to the patterns being recognized and substituted but vectorization failing 'no optab'. When they are recognized 16bytes -> 16 shorts, using a pair 8byte->8short instructions is presumed, the datatypes chosen in 'vectorizable_conversion' are 'vectype_in' 8 bytes, 'vectype out' 8 shorts. This causes the scalar code to be emitted where these patterns were recognized. For the following testcases with: gcc -O3 #include <stdint.h> extern void wdiff( int16_t d[16], uint8_t *restrict pix1, uint8_t *restrict pix2) { for( int y = 0; y < 4; y++ ) { for( int x = 0; x < 4; x++ ) d[x + y*4] = pix1[x] * pix2[x]; pix1 += 16; pix2 += 16; } The following output is seen, processing 8 elements per cycle using scalar instructions and 8 elements per cycle using vector instructions. wdiff: .LFB0: .cfi_startproc ldrb w3, [x1, 32] ldrb w6, [x2, 32] ldrb w8, [x1, 33] ldrb w5, [x2, 33] ldrb w4, [x1, 34] mul w3, w3, w6 ldrb w7, [x1, 35] fmov s0, w3 ldrb w3, [x2, 34] mul w8, w8, w5 ldrb w9, [x2, 35] ldrb w6, [x2, 48] ldrb w5, [x1, 49] ins v0.h[1], w8 mul w3, w4, w3 mul w7, w7, w9 ldrb w4, [x1, 48] ldrb w8, [x2, 49] ldrb w9, [x2, 50] ins v0.h[2], w3 ldrb w3, [x1, 51] mul w6, w6, w4 ldrb w4, [x1, 50] mul w5, w5, w8 ldrb w8, [x2, 51] ldr d2, [x1] ins v0.h[3], w7 ldr d1, [x2] mul w4, w4, w9 ldr d4, [x1, 16] ldr d3, [x2, 16] mul w1, w3, w8 ins v0.h[4], w6 zip1 v2.2s, v2.2s, v4.2s zip1 v1.2s, v1.2s, v3.2s ins v0.h[5], w5 umull v1.8h, v1.8b, v2.8b ins v0.h[6], w4 ins v0.h[7], w1 stp q1, q0, [x0] ret if the widening multiply instruction is disabled e.g.: - { vect_recog_widen_mult_pattern, "widen_mult" }, + //{ vect_recog_widen_mult_pattern, "widen_mult" }, in tree-vect-patterns.c then the same testcase is able to process 16 elements per cycle using vector instructions. wdiff: .LFB0: .cfi_startproc ldr b3, [x1, 33] ldr b2, [x2, 33] ldr b1, [x1, 32] ldr b0, [x2, 32] ldr b5, [x1, 34] ins v1.b[1], v3.b[0] ldr b4, [x2, 34] ins v0.b[1], v2.b[0] ldr b3, [x1, 35] ldr b2, [x2, 35] ldr b19, [x1, 48] ins v1.b[2], v5.b[0] ldr b17, [x2, 48] ins v0.b[2], v4.b[0] ldr b18, [x1, 49] ldr b16, [x2, 49] ldr b7, [x1, 50] ins v1.b[3], v3.b[0] ldr b6, [x2, 50] ins v0.b[3], v2.b[0] ldr b5, [x1, 51] ldr b4, [x2, 51] ldr d3, [x1] ins v1.b[4], v19.b[0] ldr d2, [x2] ins v0.b[4], v17.b[0] ldr d19, [x1, 16] ldr d17, [x2, 16] ins v1.b[5], v18.b[0] zip1 v3.2s, v3.2s, v19.2s ins v0.b[5], v16.b[0] zip1 v2.2s, v2.2s, v17.2s ins v1.b[6], v7.b[0] umull v2.8h, v2.8b, v3.8b ins v0.b[6], v6.b[0] ins v1.b[7], v5.b[0] ins v0.b[7], v4.b[0] umull v0.8h, v0.8b, v1.8b stp q2, q0, [x0] ret .cfi_endproc note the use of 2 umull instructions. The same can be seen for widening plus and widening minus. It appears to be due to the way than the vectype_in is chosen in vectorizable conversion, in vectorizable conversion, tree-vect-stmts.c:4626 vect_is_simple_use fills the &vectype1_in parameter, which fills the vectype_in parameter. during slp vectorization vect_is_simple_use uses the slp tree vectype: tree-vect-stmts.c: 11369 if (slp_node) 11370 { 11371 slp_tree child = SLP_TREE_CHILDREN (slp_node)[operand]; | 11372 *slp_def = child; 11373 *vectype = SLP_TREE_VECTYPE (child); 11374 if (SLP_TREE_DEF_TYPE (child) == vect_internal_def) 11375 { | |11376 *op = gimple_get_lhs (SLP_TREE_REPRESENTATIVE (child)->stmt); | |11377 return vect_is_simple_use (*op, vinfo, dt, def_stmt_info_out); | |11378 } for 'vect' vectorization, the def_stmt_info is used.