https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114440

--- Comment #3 from GCC Commits <cvs-commit at gcc dot gnu.org> ---
The master branch has been updated by Feng Xue <f...@gcc.gnu.org>:

https://gcc.gnu.org/g:db3c8c9726d0bafbb9f85b6d7027fe83602643e7

commit r15-2097-gdb3c8c9726d0bafbb9f85b6d7027fe83602643e7
Author: Feng Xue <f...@os.amperecomputing.com>
Date:   Wed May 29 17:28:14 2024 +0800

    vect: Optimize order of lane-reducing operations in loop def-use cycles

    When transforming multiple lane-reducing operations in a loop reduction
chain,
    originally, corresponding vectorized statements are generated into def-use
    cycles starting from 0. The def-use cycle with smaller index, would contain
    more statements, which means more instruction dependency. For example:

       int sum = 1;
       for (i)
         {
           sum += d0[i] * d1[i];      // dot-prod <vector(16) char>
           sum += w[i];               // widen-sum <vector(16) char>
           sum += abs(s0[i] - s1[i]); // sad <vector(8) short>
           sum += n[i];               // normal <vector(4) int>
         }

    Original transformation result:

       for (i / 16)
         {
           sum_v0 = DOT_PROD (d0_v0[i: 0 ~ 15], d1_v0[i: 0 ~ 15], sum_v0);
           sum_v1 = sum_v1;  // copy
           sum_v2 = sum_v2;  // copy
           sum_v3 = sum_v3;  // copy

           sum_v0 = WIDEN_SUM (w_v0[i: 0 ~ 15], sum_v0);
           sum_v1 = sum_v1;  // copy
           sum_v2 = sum_v2;  // copy
           sum_v3 = sum_v3;  // copy

           sum_v0 = SAD (s0_v0[i: 0 ~ 7 ], s1_v0[i: 0 ~ 7 ], sum_v0);
           sum_v1 = SAD (s0_v1[i: 8 ~ 15], s1_v1[i: 8 ~ 15], sum_v1);
           sum_v2 = sum_v2;  // copy
           sum_v3 = sum_v3;  // copy

           ...
         }

    For a higher instruction parallelism in final vectorized loop, an optimal
    means is to make those effective vector lane-reducing ops be distributed
    evenly among all def-use cycles. Transformed as the below, DOT_PROD,
    WIDEN_SUM and SADs are generated into disparate cycles, instruction
    dependency among them could be eliminated.

       for (i / 16)
         {
           sum_v0 = DOT_PROD (d0_v0[i: 0 ~ 15], d1_v0[i: 0 ~ 15], sum_v0);
           sum_v1 = sum_v1;  // copy
           sum_v2 = sum_v2;  // copy
           sum_v3 = sum_v3;  // copy

           sum_v0 = sum_v0;  // copy
           sum_v1 = WIDEN_SUM (w_v1[i: 0 ~ 15], sum_v1);
           sum_v2 = sum_v2;  // copy
           sum_v3 = sum_v3;  // copy

           sum_v0 = sum_v0;  // copy
           sum_v1 = sum_v1;  // copy
           sum_v2 = SAD (s0_v2[i: 0 ~ 7 ], s1_v2[i: 0 ~ 7 ], sum_v2);
           sum_v3 = SAD (s0_v3[i: 8 ~ 15], s1_v3[i: 8 ~ 15], sum_v3);

           ...
         }

    2024-03-22 Feng Xue <f...@os.amperecomputing.com>

    gcc/
            PR tree-optimization/114440
            * tree-vectorizer.h (struct _stmt_vec_info): Add a new field
            reduc_result_pos.
            * tree-vect-loop.cc (vect_transform_reduction): Generate
lane-reducing
            statements in an optimized order.

Reply via email to