> Am 02.06.2025 um 14:16 schrieb Tamar Christina <tamar.christ...@arm.com>:
>
>
>>
>> -----Original Message-----
>> From: Richard Biener <rguent...@suse.de>
>> Sent: Monday, May 26, 2025 2:56 PM
>> To: Tamar Christina <tamar.christ...@arm.com>
>> Cc: gcc-patches@gcc.gnu.org; nd <n...@arm.com>
>> Subject: RE: [PATCH 1/2]middle-end: Apply loop->unroll directly in vectorizer
>>
>> On Mon, 19 May 2025, Tamar Christina wrote:
>>
>>>>> /* Complete the target-specific cost calculations. */
>>>>> loop_vinfo->vector_costs->finish_cost (loop_vinfo->scalar_costs);
>>>>> vec_prologue_cost = loop_vinfo->vector_costs->prologue_cost ();
>>>>> @@ -12373,6 +12394,13 @@ vect_transform_loop (loop_vec_info
>> loop_vinfo,
>>>> gimple *loop_vectorized_call)
>>>>> dump_printf_loc (MSG_NOTE, vect_location, "Disabling unrolling due to"
>>>>> " variable-length vectorization factor\n");
>>>>> }
>>>>> +
>>>>> + /* When we have unrolled the loop due to a user requested value we
>> should
>>>>> + leave it up to the RTL unroll heuristics to determine if it's still
>>>>> worth
>>>>> + while to unroll more. */
>>>>> + if (LOOP_VINFO_USER_UNROLL (loop_vinfo))
>>>>
>>>> What I meant with copying of LOOP_VINFO_USER_UNROLL is that I think
>>>> you'll never get to this being true as you set the suggested unroll
>>>> factor for the costing attempt of the not extra unrolled loop but
>>>> the transform where you want to reset is is when the unrolling
>>>> was actually applied?
>>>
>>> It was being set on every analysis of the main loop body. Since it wasn't
>>> actually cleared until we've picked a mode and did codegen the condition
>>> would
>>> be true.
>>>
>>> However..
>>>
>>>>
>>>> That said, it would be clearer if LOOP_VINFO_USER_UNROLL would be
>>>> set in vect_analyze_loop_1 where we have
>>>>
>>>
>>> I agree this is much nicer.
>>>
>>> Bootstrapped Regtested on aarch64-none-linux-gnu,
>>> arm-none-linux-gnueabihf, x86_64-pc-linux-gnu
>>> -m32, -m64 and no issues.
>>>
>>> Ok for master?
>>>
>>> Thanks,
>>> Tamar
>>>
>>> gcc/ChangeLog:
>>>
>>> * doc/extend.texi: Document pragma unroll interaction with vectorizer.
>>> * tree-vectorizer.h (LOOP_VINFO_USER_UNROLL): New.
>>> (class _loop_vec_info): Add user_unroll.
>>> * tree-vect-loop.cc (vect_analyze_loop_1 ): Set
>>> suggested_unroll_factor and retry.
>>> (_loop_vec_info::_loop_vec_info): Initialize user_unroll.
>>> (vect_transform_loop): Clear the loop->unroll value if the pragma was
>>> used.
>>>
>>> gcc/testsuite/ChangeLog:
>>>
>>> * gcc.target/aarch64/unroll-vect.c: New test.
>>>
>>> -- inline copy of patch --
>>>
>>> diff --git a/gcc/doc/extend.texi b/gcc/doc/extend.texi
>>> index
>> e87a3c271f8420d8fd175823b5bb655f76c89afe..f8261d13903afc90d3341c09a
>> b3fdbd0ab96ea49 100644
>>> --- a/gcc/doc/extend.texi
>>> +++ b/gcc/doc/extend.texi
>>> @@ -10398,6 +10398,11 @@ unrolled @var{n} times regardless of any
>> commandline arguments.
>>> When the option is @var{preferred} then the user is allowed to override the
>>> unroll amount through commandline options.
>>>
>>> +If the loop was vectorized the unroll factor specified will be used to
>>> seed the
>>> +vectorizer unroll factor. Whether the loop is unrolled or not will be
>>> +determined by target costing. The resulting vectorized loop may still be
>>> +unrolled more in later passes depending on the target costing.
>>> +
>>> @end table
>>>
>>> @node Thread-Local
>>> diff --git a/gcc/testsuite/gcc.target/aarch64/unroll-vect.c
>> b/gcc/testsuite/gcc.target/aarch64/unroll-vect.c
>>> new file mode 100644
>>> index
>> 0000000000000000000000000000000000000000..3cb774ba95787ebee488fb
>> e7306299ef28e6bb35
>>> --- /dev/null
>>> +++ b/gcc/testsuite/gcc.target/aarch64/unroll-vect.c
>>> @@ -0,0 +1,20 @@
>>> +/* { dg-do compile } */
>>> +/* { dg-additional-options "-O3 -march=armv8-a --param aarch64-autovec-
>> preference=asimd-only -std=gnu99" } */
>>> +/* { dg-final { check-function-bodies "**" "" "" } } */
>>> +
>>> +/*
>>> +** f1:
>>> +** ...
>>> +** add v[0-9]+.4s, v[0-9]+.4s, v[0-9]+.4s
>>> +** add v[0-9]+.4s, v[0-9]+.4s, v[0-9]+.4s
>>> +** add v[0-9]+.4s, v[0-9]+.4s, v[0-9]+.4s
>>> +** add v[0-9]+.4s, v[0-9]+.4s, v[0-9]+.4s
>>> +** ...
>>> +*/
>>> +void f1 (int *restrict a, int n)
>>> +{
>>> +#pragma GCC unroll 16
>>> + for (int i = 0; i < n; i++)
>>> + a[i] *= 2;
>>> +}
>>> +
>>> diff --git a/gcc/tree-vect-loop.cc b/gcc/tree-vect-loop.cc
>>> index
>> fe6f3cf188e40396b299ff9e814cc402bc2d4e2d..f215b6bc7881e7e659272cefbe
>> 3d5c8892ef768c 100644
>>> --- a/gcc/tree-vect-loop.cc
>>> +++ b/gcc/tree-vect-loop.cc
>>> @@ -1073,6 +1073,7 @@ _loop_vec_info::_loop_vec_info (class loop *loop_in,
>> vec_info_shared *shared)
>>> peeling_for_gaps (false),
>>> peeling_for_niter (false),
>>> early_breaks (false),
>>> + user_unroll (false),
>>> no_data_dependencies (false),
>>> has_mask_store (false),
>>> scalar_loop_scaling (profile_probability::uninitialized ()),
>>> @@ -3428,27 +3429,51 @@ vect_analyze_loop_1 (class loop *loop,
>> vec_info_shared *shared,
>>> res ? "succeeded" : "failed",
>>> GET_MODE_NAME (loop_vinfo->vector_mode));
>>>
>>> - if (res && !LOOP_VINFO_EPILOGUE_P (loop_vinfo) &&
>> suggested_unroll_factor > 1)
>>> + auto user_unroll = LOOP_VINFO_LOOP (loop_vinfo)->unroll;
>>> + if (res && !LOOP_VINFO_EPILOGUE_P (loop_vinfo)
>>> + /* Check to see if the user wants to unroll or if the target wants
>>> to. */
>>> + && (suggested_unroll_factor > 1 || user_unroll > 1))
>>> {
>>> - if (dump_enabled_p ())
>>> - dump_printf_loc (MSG_NOTE, vect_location,
>>> + if (suggested_unroll_factor == 1)
>>> + {
>>> + int assumed_vf = vect_vf_for_cost (loop_vinfo);
>>> + int unroll_fact = user_unroll / assumed_vf;
>>> + suggested_unroll_factor = 1 << ceil_log2 (unroll_fact);
>>
>> So with SLP we can have non-power-of-two vectorization factors, and
>> certainly the extra unrolling we apply ontop of determining the VF
>> has no additional restrictions. So I woder why you go through
>> 1 << ceil_log2 (unroll_fact) here instead of just using
>> user_unroll / assumed_vf? IMO a truncating division is OK, we could
>> argue for rounding? If we apply costing across modes there might
>> be a mode with the unrolling exactly matching - I'm unsure if we
>> want to adhere to #pragma GCC unroll irrespective of costing or not.
>>
>> That said, as we simply do
>>
>> if (applying_suggested_uf)
>> LOOP_VINFO_VECT_FACTOR (loop_vinfo) *= loop_vinfo-
>>> suggested_unroll_factor;
>>
>> I'd say go with
>>
>> suggested_unroll_factor = user_unroll / assumed_vf;
>>
>> ?
>>
>
> The reason for it was that there's an assumption in place that the final VF
> is a power of two. Which is enforced by
>
> if (!is_gimple_val (niters_vector))
> {
> var = create_tmp_var (type, "bnd");
> gimple_seq stmts = NULL;
> niters_vector = force_gimple_operand (niters_vector, &stmts, true, var);
> gsi_insert_seq_on_edge_immediate (pe, stmts);
> /* Peeling algorithm guarantees that vector loop bound is at least ONE,
> we set range information to make niters analyzer's life easier.
> Note the number of latch iteration value can be TYPE_MAX_VALUE so
> we have to represent the vector niter TYPE_MAX_VALUE + 1 >> log_vf. */
> if (stmts != NULL && log_vf)
That’s just executed whe unrolling is enforced? The VF can easily be non-power
of two, even odd, when SLP is in effect.
> where log_vf is using exact_log2 which doesn't return null on non-power-of-2
> values,
> it returns -1;
>
> So the check is somewhat incorrect.
Wrong I’d say.
> We'd need
>
> if (stmts != NULL && log_vf && ! integer_minus_onep (log_vf))
> {
>
> But I didn't do so because we would then not set range information for
> niters_vector.
But does the code rely on logVF for the range? Why would it only work for that?
Peeling ensures niter/VF >= 1 even for VF == 3
> So instead I just made sure that the unfoll factor is a power of two, such
> that the multiplication
> returns a power of two.
>
> Would you like the extra !integer_minus_onep (log_vf)) instead? It does seem
> that many things
> would be less optimal. E.g. in
>
> /* Create: niters >> log2(vf) */
> /* If it's known that niters == number of latch executions + 1 doesn't
> overflow, we can generate niters >> log2(vf); otherwise we generate
> (niters - vf) >> log2(vf) + 1 by using the fact that we know ratio
> will be at least one. */
> log_vf = build_int_cst (type, exact_log2 (const_vf));
> if (niters_no_overflow)
> niters_vector = fold_build2 (RSHIFT_EXPR, type, ni_minus_gap, log_vf);
>
> even for a small non-power of two unroll factor like, 3 would then become -1,
> so niters_vector suddenly
> becomes unknown.
No, but we need to use a division instead of a shift.
Richard
>
> Thanks,
> Tamar
>
>> The rest of the patch looks OK to me.
>>
>> Richard.
>>
>>
>>> + if (suggested_unroll_factor > 1)
>>> + {
>>> + if (dump_enabled_p ())
>>> + dump_printf_loc (MSG_NOTE, vect_location,
>>> + "setting unroll factor to %d based on user requested "
>>> + "unroll factor %d and suggested vectorization "
>>> + "factor: %d\n",
>>> + suggested_unroll_factor, user_unroll, assumed_vf);
>>> + }
>>> + }
>>> +
>>> + if (suggested_unroll_factor > 1)
>>> + {
>>> + if (dump_enabled_p ())
>>> + dump_printf_loc (MSG_NOTE, vect_location,
>>> "***** Re-trying analysis for unrolling"
>>> " with unroll factor %d and slp %s.\n",
>>> suggested_unroll_factor,
>>> slp_done_for_suggested_uf ? "on" : "off");
>>> - loop_vec_info unroll_vinfo
>>> - = vect_create_loop_vinfo (loop, shared, loop_form_info, NULL);
>>> - unroll_vinfo->vector_mode = vector_mode;
>>> - unroll_vinfo->suggested_unroll_factor = suggested_unroll_factor;
>>> - opt_result new_res = vect_analyze_loop_2 (unroll_vinfo, fatal, NULL,
>>> - slp_done_for_suggested_uf);
>>> - if (new_res)
>>> - {
>>> - delete loop_vinfo;
>>> - loop_vinfo = unroll_vinfo;
>>> - }
>>> - else
>>> - delete unroll_vinfo;
>>> + loop_vec_info unroll_vinfo
>>> + = vect_create_loop_vinfo (loop, shared, loop_form_info, NULL);
>>> + unroll_vinfo->vector_mode = vector_mode;
>>> + unroll_vinfo->suggested_unroll_factor = suggested_unroll_factor;
>>> + opt_result new_res
>>> + = vect_analyze_loop_2 (unroll_vinfo, fatal, NULL,
>>> + slp_done_for_suggested_uf);
>>> + if (new_res)
>>> + {
>>> + delete loop_vinfo;
>>> + loop_vinfo = unroll_vinfo;
>>> + LOOP_VINFO_USER_UNROLL (loop_vinfo) = user_unroll > 1;
>>> + }
>>> + else
>>> + delete unroll_vinfo;
>>> + }
>>> }
>>>
>>> /* Remember the autodetected vector mode. */
>>> @@ -12373,6 +12398,13 @@ vect_transform_loop (loop_vec_info loop_vinfo,
>> gimple *loop_vectorized_call)
>>> dump_printf_loc (MSG_NOTE, vect_location, "Disabling unrolling due to"
>>> " variable-length vectorization factor\n");
>>> }
>>> +
>>> + /* When we have unrolled the loop due to a user requested value we should
>>> + leave it up to the RTL unroll heuristics to determine if it's still
>>> worth
>>> + while to unroll more. */
>>> + if (LOOP_VINFO_USER_UNROLL (loop_vinfo))
>>> + loop->unroll = 0;
>>> +
>>> /* Free SLP instances here because otherwise stmt reference counting
>>> won't work. */
>>> slp_instance instance;
>>> diff --git a/gcc/tree-vectorizer.h b/gcc/tree-vectorizer.h
>>> index
>> a2f33a5ecd60288fe7f28ee639ff8b6a77667796..8fd8c10ec64f7241d6b097491f
>> 84400164893911 100644
>>> --- a/gcc/tree-vectorizer.h
>>> +++ b/gcc/tree-vectorizer.h
>>> @@ -970,6 +970,10 @@ public:
>>> /* Main loop IV cond. */
>>> gcond* loop_iv_cond;
>>>
>>> + /* True if we have an unroll factor requested by the user through pragma
>>> GCC
>>> + unroll. */
>>> + bool user_unroll;
>>> +
>>> /* True if there are no loop carried data dependencies in the loop.
>>> If loop->safelen <= 1, then this is always true, either the loop
>>> didn't have any loop carried data dependencies, or the loop is being
>>> @@ -1094,6 +1098,7 @@ public:
>>> #define LOOP_VINFO_CHECK_UNEQUAL_ADDRS(L) (L)->check_unequal_addrs
>>> #define LOOP_VINFO_CHECK_NONZERO(L) (L)->check_nonzero
>>> #define LOOP_VINFO_LOWER_BOUNDS(L) (L)->lower_bounds
>>> +#define LOOP_VINFO_USER_UNROLL(L) (L)->user_unroll
>>> #define LOOP_VINFO_GROUPED_STORES(L) (L)->grouped_stores
>>> #define LOOP_VINFO_SLP_INSTANCES(L) (L)->slp_instances
>>> #define LOOP_VINFO_SLP_UNROLLING_FACTOR(L) (L)->slp_unrolling_factor
>>>
>>
>> --
>> Richard Biener <rguent...@suse.de>
>> SUSE Software Solutions Germany GmbH,
>> Frankenstrasse 146, 90461 Nuernberg, Germany;
>> GF: Ivo Totev, Andrew McDonald, Werner Knoblich; (HRB 36809, AG Nuernberg)