[Bug tree-optimization/113091] Over-estimate SLP vector-to-scalar cost for non-live pattern statement

fxue at os dot amperecomputing.com via Gcc-bugs Wed, 20 Dec 2023 21:25:38 -0800

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113091


--- Comment #2 from Feng Xue <fxue at os dot amperecomputing.com> ---
(In reply to Richard Biener from comment #1)
> It's the logic
> 
>   FOR_EACH_VEC_ELT (SLP_TREE_SCALAR_STMTS (node), i, stmt_info)
>     {
>       if (svisited.contains (stmt_info))
>         continue;
>       stmt_vec_info orig_stmt_info = vect_orig_stmt (stmt_info);
>       if (STMT_VINFO_IN_PATTERN_P (orig_stmt_info)
>           && STMT_VINFO_RELATED_STMT (orig_stmt_info) != stmt_info)
>         /* Only the pattern root stmt computes the original scalar value.  */
>         continue;
>       bool mark_visited = true;
>       gimple *orig_stmt = orig_stmt_info->stmt;
>       ssa_op_iter op_iter;
>       def_operand_p def_p;
>       FOR_EACH_PHI_OR_STMT_DEF (def_p, orig_stmt, op_iter, SSA_OP_DEF)
>         {
>           imm_use_iterator use_iter;
>           gimple *use_stmt;
>           stmt_vec_info use_stmt_info;
>           FOR_EACH_IMM_USE_STMT (use_stmt, use_iter, DEF_FROM_PTR (def_p))
>             if (!is_gimple_debug (use_stmt))
>               {
>                 use_stmt_info = bb_vinfo->lookup_stmt (use_stmt);
>                 if (!use_stmt_info
>                     || !PURE_SLP_STMT (vect_stmt_to_vectorize
> (use_stmt_info)))
>                   {
>                     STMT_VINFO_LIVE_P (stmt_info) = true;
> 
> specifically the last check.  That's supposed to pick up the "main" pattern
> that's now covering the scalar stmt.
> 
> But somehow the "main" pattern,
> 
> patt_67 = .VEC_WIDEN_MINUS (_1, _3);  //  _5 = _2 - _4;
>  patt_68 = (signed short) patt_67;     //  _5 = _2 - _4;
>  patt_69 = (int) patt_68;              //  _5 = _2 - _4;
> 
> doesn't get picked up here.  I wonder what's the orig_stmt and the def
> picked and what original scalar use we end up in where the
> vect_stmt_to_vectorize isn't the "last" pattern.  Maybe we really want


This problem happens at slp node:

 note: node 0x425bc38 (max_nunits=8, refcnt=1) vector(8) char
 note: op template: _1 = *a_50(D);
 note:  stmt 0 _1 = *a_50(D);
 note:  stmt 1 _7 = MEM[(char *)a_50(D) + 1B];
 note:  stmt 2 _13 = MEM[(char *)a_50(D) + 2B];
 note:  stmt 3 _19 = MEM[(char *)a_50(D) + 3B];
 note:  stmt 4 _25 = MEM[(char *)a_50(D) + 4B];
 note:  stmt 5 _31 = MEM[(char *)a_50(D) + 5B];
 note:  stmt 6 _37 = MEM[(char *)a_50(D) + 6B];
 note:  stmt 7 _43 = MEM[(char *)a_50(D) + 7B];

The orig_stmt is "_1 = *a_50(D)"

The use stmt is "_2 = (int) _1", whose pattern statement is "patt_64 = (int)
patt_63", which is not referenced by any original or other pattern statements.
Or in other word, the orig_stmt could be absorbed into a vector operation,
without any outlier scalar use.

The fore-mentioned "last check" in vect_bb_slp_mark_live_stmts would make the
orig_stmt to be STMT_VINFO_LIVE_P, which actually implies it has scalar use
(though it should not have), the difference is re-generating the def somewhere,
rather than retaining the original scalar statement. And the following
"vectorizable_live_operation" would account the new operations into
vectorization cost of the SLP instance.

The function vect_bb_vectorization_profitable_p resorts to a recursive way to
identify scalar use, for this case, setting STMT_VINFO_LIVE_P or not would
change scalar cost computation. If we can avoid such fake-liveness adjustment
on the statements we are interested in, vectorization cost could beat scalar
cost, and make the former succeed.

   Unvectorized: 
        mov     x2, x0
        stp     x29, x30, [sp, -48]!
        mov     x29, sp
        ldrb    w3, [x1]
        ldrb    w4, [x1, 1]
        add     x0, sp, 16
        ldrb    w9, [x2]
        ldrb    w8, [x2, 1]
        sub     w9, w9, w3
        ldrb    w7, [x2, 2]
        ldrb    w3, [x1, 2]
        sub     w8, w8, w4
        ldrb    w6, [x2, 3]
        ldrb    w4, [x1, 3]
        sub     w7, w7, w3
        ldrb    w10, [x1, 5]
        ldrb    w3, [x1, 4]
        sub     w6, w6, w4
        ldrb    w5, [x2, 4]
        ldrb    w4, [x2, 5]
        sub     w5, w5, w3
        ldrb    w3, [x2, 6]
        sub     w4, w4, w10
        ldrb    w2, [x2, 7]
        ldrb    w10, [x1, 6]
        ldrb    w1, [x1, 7]
        sub     w3, w3, w10
        stp     w9, w8, [sp, 16]
        sub     w1, w2, w1
        stp     w7, w6, [sp, 24]
        stp     w5, w4, [sp, 32]
        stp     w3, w1, [sp, 40]
        bl      test
        ldp     x29, x30, [sp], 48
        ret


    Vectorized:
        mov     x2, x0
        stp     x29, x30, [sp, -48]!
        mov     x29, sp
        ldr     d1, [x1]
        add     x0, sp, 16
        ldr     d0, [x2]
        usubl   v0.8h, v0.8b, v1.8b
        sxtl    v1.4s, v0.4h
        sxtl2   v0.4s, v0.8h
        stp     q1, q0, [sp, 16]
        bl      test
        ldp     x29, x30, [sp], 48
        ret


> these "overlapping" patterns, but IMHO having "two entries" into
> a chain of scalar stmts is bad and we should link up the whole matched
> sequence to the final "root" instead?
> 
> That said, the current code doesn't see that wherever we end up isn't
> dead code (aka fully covered by the vectorization).
> 
> IMO vect_stmt_to_vectorize for each of those stmts should end up at
> 
> patt_69 = (int) patt_68;

[Bug tree-optimization/113091] Over-estimate SLP vector-to-scalar cost for non-live pattern statement

Reply via email to