Richard Biener <richard.guent...@gmail.com> writes:
>> Am 06.07.2023 um 19:50 schrieb Richard Sandiford <richard.sandif...@arm.com>:
>> 
>> Richard Biener via Gcc-patches <gcc-patches@gcc.gnu.org> writes:
>>>> On Wed, Jul 5, 2023 at 8:44 AM Hao Liu OS via Gcc-patches
>>>> <gcc-patches@gcc.gnu.org> wrote:
>>>> 
>>>> Hi,
>>>> 
>>>> If a loop is unrolled by n times during vectoriation, two steps are used to
>>>> calculate the induction variable:
>>>>  - The small step for the unrolled ith-copy: vec_1 = vec_iv + (VF/n * Step)
>>>>  - The large step for the whole loop: vec_loop = vec_iv + (VF * Step)
>>>> 
>>>> This patch calculates an extra vec_n to replace vec_loop:
>>>>  vec_n = vec_prev + (VF/n * S) = vec_iv + (VF/n * S) * n = vec_loop.
>>>> 
>>>> So that we can save the large step register and related operations.
>>> 
>>> OK.  It would be nice to avoid the dead stmts created earlier though.
>> 
>> FWIW, I still don't think we should do this.  Part of the point of
>> unrolling is to shorten loop-carried dependencies, whereas this patch
>> is going in the opposite direction.
>
> Note ncopies can be >1 without additional unrolling.

Yeah, true.  But I think even there, avoiding a longer loop-carried
dependency should be a good thing.

> With non VLA vectors all of the updates will be constant folded btw.

Are you sure?  The motivating example is an Advanced SIMD one,
not a VLA one.  No variable-length vectors are involved.

Maybe constant folding caps the dependency chain to length 2?
But 2 is still more than 1. :)

Thanks,
Richard

>
> Richard 
>
>> Richard
>> 
>>> 
>>> Thanks,
>>> Richard.
>>> 
>>>> gcc/ChangeLog:
>>>> 
>>>>        PR tree-optimization/110449
>>>>        * tree-vect-loop.cc (vectorizable_induction): use vec_n to replace
>>>>        vec_loop for the unrolled loop.
>>>> 
>>>> gcc/testsuite/ChangeLog:
>>>> 
>>>>        * gcc.target/aarch64/pr110449.c: New testcase.
>>>> ---
>>>> gcc/testsuite/gcc.target/aarch64/pr110449.c | 40 +++++++++++++++++++++
>>>> gcc/tree-vect-loop.cc                       | 21 +++++++++--
>>>> 2 files changed, 58 insertions(+), 3 deletions(-)
>>>> create mode 100644 gcc/testsuite/gcc.target/aarch64/pr110449.c
>>>> 
>>>> diff --git a/gcc/testsuite/gcc.target/aarch64/pr110449.c 
>>>> b/gcc/testsuite/gcc.target/aarch64/pr110449.c
>>>> new file mode 100644
>>>> index 00000000000..bb3b6dcfe08
>>>> --- /dev/null
>>>> +++ b/gcc/testsuite/gcc.target/aarch64/pr110449.c
>>>> @@ -0,0 +1,40 @@
>>>> +/* { dg-do compile } */
>>>> +/* { dg-options "-Ofast -mcpu=neoverse-n2 --param 
>>>> aarch64-vect-unroll-limit=2" } */
>>>> +/* { dg-final { scan-assembler-not "8.0e\\+0" } } */
>>>> +
>>>> +/* Calcualte the vectorized induction with smaller step for an unrolled 
>>>> loop.
>>>> +
>>>> +   before (suggested_unroll_factor=2):
>>>> +         fmov    s30, 8.0e+0
>>>> +         fmov    s31, 4.0e+0
>>>> +         dup     v27.4s, v30.s[0]
>>>> +         dup     v28.4s, v31.s[0]
>>>> +     .L6:
>>>> +         mov     v30.16b, v31.16b
>>>> +         fadd    v31.4s, v31.4s, v27.4s
>>>> +         fadd    v29.4s, v30.4s, v28.4s
>>>> +         stp     q30, q29, [x0]
>>>> +         add     x0, x0, 32
>>>> +         cmp     x1, x0
>>>> +         bne     .L6
>>>> +
>>>> +   after:
>>>> +         fmov    s31, 4.0e+0
>>>> +         dup     v29.4s, v31.s[0]
>>>> +     .L6:
>>>> +         fadd    v30.4s, v31.4s, v29.4s
>>>> +         stp     q31, q30, [x0]
>>>> +         add     x0, x0, 32
>>>> +         fadd    v31.4s, v29.4s, v30.4s
>>>> +         cmp     x0, x1
>>>> +         bne     .L6  */
>>>> +
>>>> +void
>>>> +foo2 (float *arr, float freq, float step)
>>>> +{
>>>> +  for (int i = 0; i < 1024; i++)
>>>> +    {
>>>> +      arr[i] = freq;
>>>> +      freq += step;
>>>> +    }
>>>> +}
>>>> diff --git a/gcc/tree-vect-loop.cc b/gcc/tree-vect-loop.cc
>>>> index 3b46c58a8d8..706ecbffd0c 100644
>>>> --- a/gcc/tree-vect-loop.cc
>>>> +++ b/gcc/tree-vect-loop.cc
>>>> @@ -10114,7 +10114,7 @@ vectorizable_induction (loop_vec_info loop_vinfo,
>>>>                                   new_vec, step_vectype, NULL);
>>>> 
>>>>       vec_def = induc_def;
>>>> -      for (i = 1; i < ncopies; i++)
>>>> +      for (i = 1; i < ncopies + 1; i++)
>>>>        {
>>>>          /* vec_i = vec_prev + vec_step  */
>>>>          gimple_seq stmts = NULL;
>>>> @@ -10124,8 +10124,23 @@ vectorizable_induction (loop_vec_info loop_vinfo,
>>>>          vec_def = gimple_convert (&stmts, vectype, vec_def);
>>>> 
>>>>          gsi_insert_seq_before (&si, stmts, GSI_SAME_STMT);
>>>> -         new_stmt = SSA_NAME_DEF_STMT (vec_def);
>>>> -         STMT_VINFO_VEC_STMTS (stmt_info).safe_push (new_stmt);
>>>> +         if (i < ncopies)
>>>> +           {
>>>> +             new_stmt = SSA_NAME_DEF_STMT (vec_def);
>>>> +             STMT_VINFO_VEC_STMTS (stmt_info).safe_push (new_stmt);
>>>> +           }
>>>> +         else
>>>> +           {
>>>> +             /* vec_1 = vec_iv + (VF/n * S)
>>>> +                vec_2 = vec_1 + (VF/n * S)
>>>> +                ...
>>>> +                vec_n = vec_prev + (VF/n * S) = vec_iv + VF * S = vec_loop
>>>> +
>>>> +                vec_n is used as vec_loop to save the large step register 
>>>> and
>>>> +                related operations.  */
>>>> +             add_phi_arg (induction_phi, vec_def, loop_latch_edge 
>>>> (iv_loop),
>>>> +                          UNKNOWN_LOCATION);
>>>> +           }
>>>>        }
>>>>     }
>>>> 
>>>> --
>>>> 2.34.1

Reply via email to