unroll directly in vectorizer

Richard Biener Tue, 03 Jun 2025 00:32:33 -0700

> Am 03.06.2025 um 09:26 schrieb Richard Biener <rguent...@suse.de>:
> 
> 
> 
>> Am 02.06.2025 um 14:16 schrieb Tamar Christina <tamar.christ...@arm.com>:
>> 
>> 
>>> 
>>> -----Original Message-----
>>> From: Richard Biener <rguent...@suse.de>
>>> Sent: Monday, May 26, 2025 2:56 PM
>>> To: Tamar Christina <tamar.christ...@arm.com>
>>> Cc: gcc-patches@gcc.gnu.org; nd <n...@arm.com>
>>> Subject: RE: [PATCH 1/2]middle-end: Apply loop->unroll directly in 
>>> vectorizer
>>> 
>>>> On Mon, 19 May 2025, Tamar Christina wrote:
>>> 
>>>>>>  /* Complete the target-specific cost calculations.  */
>>>>>>  loop_vinfo->vector_costs->finish_cost (loop_vinfo->scalar_costs);
>>>>>>  vec_prologue_cost = loop_vinfo->vector_costs->prologue_cost ();
>>>>>> @@ -12373,6 +12394,13 @@ vect_transform_loop (loop_vec_info
>>> loop_vinfo,
>>>>> gimple *loop_vectorized_call)
>>>>>>   dump_printf_loc (MSG_NOTE, vect_location, "Disabling unrolling due to"
>>>>>>            " variable-length vectorization factor\n");
>>>>>>    }
>>>>>> +
>>>>>> +  /* When we have unrolled the loop due to a user requested value we
>>> should
>>>>>> +     leave it up to the RTL unroll heuristics to determine if it's 
>>>>>> still worth
>>>>>> +     while to unroll more.  */
>>>>>> +  if (LOOP_VINFO_USER_UNROLL (loop_vinfo))
>>>>> 
>>>>> What I meant with copying of LOOP_VINFO_USER_UNROLL is that I think
>>>>> you'll never get to this being true as you set the suggested unroll
>>>>> factor for the costing attempt of the not extra unrolled loop but
>>>>> the transform where you want to reset is is when the unrolling
>>>>> was actually applied?
>>>> 
>>>> It was being set on every analysis of the main loop body.  Since it wasn't
>>>> actually cleared until we've picked a mode and did codegen the condition 
>>>> would
>>>> be true.
>>>> 
>>>> However..
>>>> 
>>>>> 
>>>>> That said, it would be clearer if LOOP_VINFO_USER_UNROLL would be
>>>>> set in vect_analyze_loop_1 where we have
>>>>> 
>>>> 
>>>> I agree this is much nicer.
>>>> 
>>>> Bootstrapped Regtested on aarch64-none-linux-gnu,
>>>> arm-none-linux-gnueabihf, x86_64-pc-linux-gnu
>>>> -m32, -m64 and no issues.
>>>> 
>>>> Ok for master?
>>>> 
>>>> Thanks,
>>>> Tamar
>>>> 
>>>> gcc/ChangeLog:
>>>> 
>>>>   * doc/extend.texi: Document pragma unroll interaction with vectorizer.
>>>>   * tree-vectorizer.h (LOOP_VINFO_USER_UNROLL): New.
>>>>   (class _loop_vec_info): Add user_unroll.
>>>>   * tree-vect-loop.cc (vect_analyze_loop_1 ): Set
>>>>   suggested_unroll_factor and retry.
>>>>   (_loop_vec_info::_loop_vec_info): Initialize user_unroll.
>>>>   (vect_transform_loop): Clear the loop->unroll value if the pragma was
>>>>   used.
>>>> 
>>>> gcc/testsuite/ChangeLog:
>>>> 
>>>>   * gcc.target/aarch64/unroll-vect.c: New test.
>>>> 
>>>> -- inline copy of patch --
>>>> 
>>>> diff --git a/gcc/doc/extend.texi b/gcc/doc/extend.texi
>>>> index
>>> e87a3c271f8420d8fd175823b5bb655f76c89afe..f8261d13903afc90d3341c09a
>>> b3fdbd0ab96ea49 100644
>>>> --- a/gcc/doc/extend.texi
>>>> +++ b/gcc/doc/extend.texi
>>>> @@ -10398,6 +10398,11 @@ unrolled @var{n} times regardless of any
>>> commandline arguments.
>>>> When the option is @var{preferred} then the user is allowed to override the
>>>> unroll amount through commandline options.
>>>> 
>>>> +If the loop was vectorized the unroll factor specified will be used to 
>>>> seed the
>>>> +vectorizer unroll factor.  Whether the loop is unrolled or not will be
>>>> +determined by target costing.  The resulting vectorized loop may still be
>>>> +unrolled more in later passes depending on the target costing.
>>>> +
>>>> @end table
>>>> 
>>>> @node Thread-Local
>>>> diff --git a/gcc/testsuite/gcc.target/aarch64/unroll-vect.c
>>> b/gcc/testsuite/gcc.target/aarch64/unroll-vect.c
>>>> new file mode 100644
>>>> index
>>> 0000000000000000000000000000000000000000..3cb774ba95787ebee488fb
>>> e7306299ef28e6bb35
>>>> --- /dev/null
>>>> +++ b/gcc/testsuite/gcc.target/aarch64/unroll-vect.c
>>>> @@ -0,0 +1,20 @@
>>>> +/* { dg-do compile } */
>>>> +/* { dg-additional-options "-O3 -march=armv8-a --param aarch64-autovec-
>>> preference=asimd-only -std=gnu99" } */
>>>> +/* { dg-final { check-function-bodies "**" "" "" } } */
>>>> +
>>>> +/*
>>>> +** f1:
>>>> +**    ...
>>>> +**    add    v[0-9]+.4s, v[0-9]+.4s, v[0-9]+.4s
>>>> +**    add    v[0-9]+.4s, v[0-9]+.4s, v[0-9]+.4s
>>>> +**    add    v[0-9]+.4s, v[0-9]+.4s, v[0-9]+.4s
>>>> +**    add    v[0-9]+.4s, v[0-9]+.4s, v[0-9]+.4s
>>>> +**    ...
>>>> +*/
>>>> +void f1 (int *restrict a, int n)
>>>> +{
>>>> +#pragma GCC unroll 16
>>>> +  for (int i = 0; i < n; i++)
>>>> +    a[i] *= 2;
>>>> +}
>>>> +
>>>> diff --git a/gcc/tree-vect-loop.cc b/gcc/tree-vect-loop.cc
>>>> index
>>> fe6f3cf188e40396b299ff9e814cc402bc2d4e2d..f215b6bc7881e7e659272cefbe
>>> 3d5c8892ef768c 100644
>>>> --- a/gcc/tree-vect-loop.cc
>>>> +++ b/gcc/tree-vect-loop.cc
>>>> @@ -1073,6 +1073,7 @@ _loop_vec_info::_loop_vec_info (class loop *loop_in,
>>> vec_info_shared *shared)
>>>>    peeling_for_gaps (false),
>>>>    peeling_for_niter (false),
>>>>    early_breaks (false),
>>>> +    user_unroll (false),
>>>>    no_data_dependencies (false),
>>>>    has_mask_store (false),
>>>>    scalar_loop_scaling (profile_probability::uninitialized ()),
>>>> @@ -3428,27 +3429,51 @@ vect_analyze_loop_1 (class loop *loop,
>>> vec_info_shared *shared,
>>>>            res ? "succeeded" : "failed",
>>>>            GET_MODE_NAME (loop_vinfo->vector_mode));
>>>> 
>>>> -  if (res && !LOOP_VINFO_EPILOGUE_P (loop_vinfo) &&
>>> suggested_unroll_factor > 1)
>>>> +  auto user_unroll = LOOP_VINFO_LOOP (loop_vinfo)->unroll;
>>>> +  if (res && !LOOP_VINFO_EPILOGUE_P (loop_vinfo)
>>>> +      /* Check to see if the user wants to unroll or if the target wants 
>>>> to.  */
>>>> +      && (suggested_unroll_factor > 1 || user_unroll > 1))
>>>>    {
>>>> -      if (dump_enabled_p ())
>>>> -    dump_printf_loc (MSG_NOTE, vect_location,
>>>> +      if (suggested_unroll_factor == 1)
>>>> +    {
>>>> +      int assumed_vf = vect_vf_for_cost (loop_vinfo);
>>>> +      int unroll_fact = user_unroll / assumed_vf;
>>>> +      suggested_unroll_factor = 1 << ceil_log2 (unroll_fact);
>>> 
>>> So with SLP we can have non-power-of-two vectorization factors, and
>>> certainly the extra unrolling we apply ontop of determining the VF
>>> has no additional restrictions.  So I woder why you go through
>>> 1 << ceil_log2 (unroll_fact) here instead of just using
>>> user_unroll / assumed_vf?  IMO a truncating division is OK, we could
>>> argue for rounding?  If we apply costing across modes there might
>>> be a mode with the unrolling exactly matching - I'm unsure if we
>>> want to adhere to #pragma GCC unroll irrespective of costing or not.
>>> 
>>> That said, as we simply do
>>> 
>>> if (applying_suggested_uf)
>>>   LOOP_VINFO_VECT_FACTOR (loop_vinfo) *= loop_vinfo-
>>>> suggested_unroll_factor;
>>> 
>>> I'd say go with
>>> 
>>>  suggested_unroll_factor = user_unroll / assumed_vf;
>>> 
>>> ?
>>> 
>> 
>> The reason for it was that there's an assumption in place that the final VF
>> is a power of two.  Which is enforced by
>> 
>> if (!is_gimple_val (niters_vector))
>>   {
>>     var = create_tmp_var (type, "bnd");
>>     gimple_seq stmts = NULL;
>>     niters_vector = force_gimple_operand (niters_vector, &stmts, true, var);
>>     gsi_insert_seq_on_edge_immediate (pe, stmts);
>>     /* Peeling algorithm guarantees that vector loop bound is at least ONE,
>>    we set range information to make niters analyzer's life easier.
>>    Note the number of latch iteration value can be TYPE_MAX_VALUE so
>>    we have to represent the vector niter TYPE_MAX_VALUE + 1 >> log_vf.  */
>>     if (stmts != NULL && log_vf)
> 
> That’s just executed whe unrolling is enforced?  The VF can easily be 
> non-power of two, even odd, when SLP is in effect.
> 
>> where log_vf is using exact_log2 which doesn't return null on non-power-of-2 
>> values,
>> it returns -1;
>> 
>> So the check is somewhat incorrect.
> 
> Wrong I’d say.
> 
>> We'd need
>> 
>>     if (stmts != NULL && log_vf && ! integer_minus_onep (log_vf))
>>   {
>> 
>> But I didn't do so because we would then not set range information for 
>> niters_vector.
> 
> But does the code rely on logVF for the range?  Why would it only work for 
> that?
> Peeling ensures niter/VF >= 1 even for VF == 3

Ok, I think I’m wrong here.  The only odd VF we’d actually chose is 1.  That 
said, we shouldn’t limit ourselves here and fixup places with this assumption.

Richard 

>> So instead I just made sure that the unfoll factor is a power of two, such 
>> that the multiplication
>> returns a power of two.
>> 
>> Would you like the extra !integer_minus_onep (log_vf)) instead? It does seem 
>> that many things
>> would be less optimal. E.g. in
>> 
>>     /* Create: niters >> log2(vf) */
>>     /* If it's known that niters == number of latch executions + 1 doesn't
>>    overflow, we can generate niters >> log2(vf); otherwise we generate
>>    (niters - vf) >> log2(vf) + 1 by using the fact that we know ratio
>>    will be at least one.  */
>>     log_vf = build_int_cst (type, exact_log2 (const_vf));
>>     if (niters_no_overflow)
>>   niters_vector = fold_build2 (RSHIFT_EXPR, type, ni_minus_gap, log_vf);
>> 
>> even for a small non-power of two unroll factor like, 3 would then become 
>> -1, so niters_vector suddenly
>> becomes unknown.
> 
> No, but we need to use a division instead of a shift.
> 
> Richard
> 
>> 
>> Thanks,
>> Tamar
>> 
>>> The rest of the patch looks OK to me.
>>> 
>>> Richard.
>>> 
>>> 
>>>> +      if (suggested_unroll_factor > 1)
>>>> +        {
>>>> +          if (dump_enabled_p ())
>>>> +        dump_printf_loc (MSG_NOTE, vect_location,
>>>> +             "setting unroll factor to %d based on user requested "
>>>> +             "unroll factor %d and suggested vectorization "
>>>> +             "factor: %d\n",
>>>> +             suggested_unroll_factor, user_unroll, assumed_vf);
>>>> +        }
>>>> +    }
>>>> +
>>>> +    if (suggested_unroll_factor > 1)
>>>> +      {
>>>> +        if (dump_enabled_p ())
>>>> +          dump_printf_loc (MSG_NOTE, vect_location,
>>>>            "***** Re-trying analysis for unrolling"
>>>>            " with unroll factor %d and slp %s.\n",
>>>>            suggested_unroll_factor,
>>>>            slp_done_for_suggested_uf ? "on" : "off");
>>>> -      loop_vec_info unroll_vinfo
>>>> -    = vect_create_loop_vinfo (loop, shared, loop_form_info, NULL);
>>>> -      unroll_vinfo->vector_mode = vector_mode;
>>>> -      unroll_vinfo->suggested_unroll_factor = suggested_unroll_factor;
>>>> -      opt_result new_res = vect_analyze_loop_2 (unroll_vinfo, fatal, NULL,
>>>> -                        slp_done_for_suggested_uf);
>>>> -      if (new_res)
>>>> -    {
>>>> -      delete loop_vinfo;
>>>> -      loop_vinfo = unroll_vinfo;
>>>> -    }
>>>> -      else
>>>> -    delete unroll_vinfo;
>>>> +        loop_vec_info unroll_vinfo
>>>> +        = vect_create_loop_vinfo (loop, shared, loop_form_info, NULL);
>>>> +        unroll_vinfo->vector_mode = vector_mode;
>>>> +        unroll_vinfo->suggested_unroll_factor = suggested_unroll_factor;
>>>> +        opt_result new_res
>>>> +        = vect_analyze_loop_2 (unroll_vinfo, fatal, NULL,
>>>> +                       slp_done_for_suggested_uf);
>>>> +        if (new_res)
>>>> +          {
>>>> +        delete loop_vinfo;
>>>> +        loop_vinfo = unroll_vinfo;
>>>> +        LOOP_VINFO_USER_UNROLL (loop_vinfo) = user_unroll > 1;
>>>> +          }
>>>> +        else
>>>> +          delete unroll_vinfo;
>>>> +      }
>>>>    }
>>>> 
>>>>  /* Remember the autodetected vector mode.  */
>>>> @@ -12373,6 +12398,13 @@ vect_transform_loop (loop_vec_info loop_vinfo,
>>> gimple *loop_vectorized_call)
>>>>   dump_printf_loc (MSG_NOTE, vect_location, "Disabling unrolling due to"
>>>>            " variable-length vectorization factor\n");
>>>>    }
>>>> +
>>>> +  /* When we have unrolled the loop due to a user requested value we 
>>>> should
>>>> +     leave it up to the RTL unroll heuristics to determine if it's still 
>>>> worth
>>>> +     while to unroll more.  */
>>>> +  if (LOOP_VINFO_USER_UNROLL (loop_vinfo))
>>>> +    loop->unroll = 0;
>>>> +
>>>>  /* Free SLP instances here because otherwise stmt reference counting
>>>>     won't work.  */
>>>>  slp_instance instance;
>>>> diff --git a/gcc/tree-vectorizer.h b/gcc/tree-vectorizer.h
>>>> index
>>> a2f33a5ecd60288fe7f28ee639ff8b6a77667796..8fd8c10ec64f7241d6b097491f
>>> 84400164893911 100644
>>>> --- a/gcc/tree-vectorizer.h
>>>> +++ b/gcc/tree-vectorizer.h
>>>> @@ -970,6 +970,10 @@ public:
>>>>  /* Main loop IV cond.  */
>>>>  gcond* loop_iv_cond;
>>>> 
>>>> +  /* True if we have an unroll factor requested by the user through 
>>>> pragma GCC
>>>> +     unroll.  */
>>>> +  bool user_unroll;
>>>> +
>>>>  /* True if there are no loop carried data dependencies in the loop.
>>>>     If loop->safelen <= 1, then this is always true, either the loop
>>>>     didn't have any loop carried data dependencies, or the loop is being
>>>> @@ -1094,6 +1098,7 @@ public:
>>>> #define LOOP_VINFO_CHECK_UNEQUAL_ADDRS(L)  (L)->check_unequal_addrs
>>>> #define LOOP_VINFO_CHECK_NONZERO(L)        (L)->check_nonzero
>>>> #define LOOP_VINFO_LOWER_BOUNDS(L)         (L)->lower_bounds
>>>> +#define LOOP_VINFO_USER_UNROLL(L)          (L)->user_unroll
>>>> #define LOOP_VINFO_GROUPED_STORES(L)       (L)->grouped_stores
>>>> #define LOOP_VINFO_SLP_INSTANCES(L)        (L)->slp_instances
>>>> #define LOOP_VINFO_SLP_UNROLLING_FACTOR(L) (L)->slp_unrolling_factor
>>>> 
>>> 
>>> --
>>> Richard Biener <rguent...@suse.de>
>>> SUSE Software Solutions Germany GmbH,
>>> Frankenstrasse 146, 90461 Nuernberg, Germany;
>>> GF: Ivo Totev, Andrew McDonald, Werner Knoblich; (HRB 36809, AG Nuernberg)
Re: [PATCH 1/2]middle-end: Apply loop->unroll directly in vectorizer

Reply via email to