> Am 03.06.2025 um 09:26 schrieb Richard Biener <rguent...@suse.de>:
>
>
>
>> Am 02.06.2025 um 14:16 schrieb Tamar Christina <tamar.christ...@arm.com>:
>>
>>
>>>
>>> -----Original Message-----
>>> From: Richard Biener <rguent...@suse.de>
>>> Sent: Monday, May 26, 2025 2:56 PM
>>> To: Tamar Christina <tamar.christ...@arm.com>
>>> Cc: gcc-patches@gcc.gnu.org; nd <n...@arm.com>
>>> Subject: RE: [PATCH 1/2]middle-end: Apply loop->unroll directly in
>>> vectorizer
>>>
>>>> On Mon, 19 May 2025, Tamar Christina wrote:
>>>
>>>>>> /* Complete the target-specific cost calculations. */
>>>>>> loop_vinfo->vector_costs->finish_cost (loop_vinfo->scalar_costs);
>>>>>> vec_prologue_cost = loop_vinfo->vector_costs->prologue_cost ();
>>>>>> @@ -12373,6 +12394,13 @@ vect_transform_loop (loop_vec_info
>>> loop_vinfo,
>>>>> gimple *loop_vectorized_call)
>>>>>> dump_printf_loc (MSG_NOTE, vect_location, "Disabling unrolling due to"
>>>>>> " variable-length vectorization factor\n");
>>>>>> }
>>>>>> +
>>>>>> + /* When we have unrolled the loop due to a user requested value we
>>> should
>>>>>> + leave it up to the RTL unroll heuristics to determine if it's
>>>>>> still worth
>>>>>> + while to unroll more. */
>>>>>> + if (LOOP_VINFO_USER_UNROLL (loop_vinfo))
>>>>>
>>>>> What I meant with copying of LOOP_VINFO_USER_UNROLL is that I think
>>>>> you'll never get to this being true as you set the suggested unroll
>>>>> factor for the costing attempt of the not extra unrolled loop but
>>>>> the transform where you want to reset is is when the unrolling
>>>>> was actually applied?
>>>>
>>>> It was being set on every analysis of the main loop body. Since it wasn't
>>>> actually cleared until we've picked a mode and did codegen the condition
>>>> would
>>>> be true.
>>>>
>>>> However..
>>>>
>>>>>
>>>>> That said, it would be clearer if LOOP_VINFO_USER_UNROLL would be
>>>>> set in vect_analyze_loop_1 where we have
>>>>>
>>>>
>>>> I agree this is much nicer.
>>>>
>>>> Bootstrapped Regtested on aarch64-none-linux-gnu,
>>>> arm-none-linux-gnueabihf, x86_64-pc-linux-gnu
>>>> -m32, -m64 and no issues.
>>>>
>>>> Ok for master?
>>>>
>>>> Thanks,
>>>> Tamar
>>>>
>>>> gcc/ChangeLog:
>>>>
>>>> * doc/extend.texi: Document pragma unroll interaction with vectorizer.
>>>> * tree-vectorizer.h (LOOP_VINFO_USER_UNROLL): New.
>>>> (class _loop_vec_info): Add user_unroll.
>>>> * tree-vect-loop.cc (vect_analyze_loop_1 ): Set
>>>> suggested_unroll_factor and retry.
>>>> (_loop_vec_info::_loop_vec_info): Initialize user_unroll.
>>>> (vect_transform_loop): Clear the loop->unroll value if the pragma was
>>>> used.
>>>>
>>>> gcc/testsuite/ChangeLog:
>>>>
>>>> * gcc.target/aarch64/unroll-vect.c: New test.
>>>>
>>>> -- inline copy of patch --
>>>>
>>>> diff --git a/gcc/doc/extend.texi b/gcc/doc/extend.texi
>>>> index
>>> e87a3c271f8420d8fd175823b5bb655f76c89afe..f8261d13903afc90d3341c09a
>>> b3fdbd0ab96ea49 100644
>>>> --- a/gcc/doc/extend.texi
>>>> +++ b/gcc/doc/extend.texi
>>>> @@ -10398,6 +10398,11 @@ unrolled @var{n} times regardless of any
>>> commandline arguments.
>>>> When the option is @var{preferred} then the user is allowed to override the
>>>> unroll amount through commandline options.
>>>>
>>>> +If the loop was vectorized the unroll factor specified will be used to
>>>> seed the
>>>> +vectorizer unroll factor. Whether the loop is unrolled or not will be
>>>> +determined by target costing. The resulting vectorized loop may still be
>>>> +unrolled more in later passes depending on the target costing.
>>>> +
>>>> @end table
>>>>
>>>> @node Thread-Local
>>>> diff --git a/gcc/testsuite/gcc.target/aarch64/unroll-vect.c
>>> b/gcc/testsuite/gcc.target/aarch64/unroll-vect.c
>>>> new file mode 100644
>>>> index
>>> 0000000000000000000000000000000000000000..3cb774ba95787ebee488fb
>>> e7306299ef28e6bb35
>>>> --- /dev/null
>>>> +++ b/gcc/testsuite/gcc.target/aarch64/unroll-vect.c
>>>> @@ -0,0 +1,20 @@
>>>> +/* { dg-do compile } */
>>>> +/* { dg-additional-options "-O3 -march=armv8-a --param aarch64-autovec-
>>> preference=asimd-only -std=gnu99" } */
>>>> +/* { dg-final { check-function-bodies "**" "" "" } } */
>>>> +
>>>> +/*
>>>> +** f1:
>>>> +** ...
>>>> +** add v[0-9]+.4s, v[0-9]+.4s, v[0-9]+.4s
>>>> +** add v[0-9]+.4s, v[0-9]+.4s, v[0-9]+.4s
>>>> +** add v[0-9]+.4s, v[0-9]+.4s, v[0-9]+.4s
>>>> +** add v[0-9]+.4s, v[0-9]+.4s, v[0-9]+.4s
>>>> +** ...
>>>> +*/
>>>> +void f1 (int *restrict a, int n)
>>>> +{
>>>> +#pragma GCC unroll 16
>>>> + for (int i = 0; i < n; i++)
>>>> + a[i] *= 2;
>>>> +}
>>>> +
>>>> diff --git a/gcc/tree-vect-loop.cc b/gcc/tree-vect-loop.cc
>>>> index
>>> fe6f3cf188e40396b299ff9e814cc402bc2d4e2d..f215b6bc7881e7e659272cefbe
>>> 3d5c8892ef768c 100644
>>>> --- a/gcc/tree-vect-loop.cc
>>>> +++ b/gcc/tree-vect-loop.cc
>>>> @@ -1073,6 +1073,7 @@ _loop_vec_info::_loop_vec_info (class loop *loop_in,
>>> vec_info_shared *shared)
>>>> peeling_for_gaps (false),
>>>> peeling_for_niter (false),
>>>> early_breaks (false),
>>>> + user_unroll (false),
>>>> no_data_dependencies (false),
>>>> has_mask_store (false),
>>>> scalar_loop_scaling (profile_probability::uninitialized ()),
>>>> @@ -3428,27 +3429,51 @@ vect_analyze_loop_1 (class loop *loop,
>>> vec_info_shared *shared,
>>>> res ? "succeeded" : "failed",
>>>> GET_MODE_NAME (loop_vinfo->vector_mode));
>>>>
>>>> - if (res && !LOOP_VINFO_EPILOGUE_P (loop_vinfo) &&
>>> suggested_unroll_factor > 1)
>>>> + auto user_unroll = LOOP_VINFO_LOOP (loop_vinfo)->unroll;
>>>> + if (res && !LOOP_VINFO_EPILOGUE_P (loop_vinfo)
>>>> + /* Check to see if the user wants to unroll or if the target wants
>>>> to. */
>>>> + && (suggested_unroll_factor > 1 || user_unroll > 1))
>>>> {
>>>> - if (dump_enabled_p ())
>>>> - dump_printf_loc (MSG_NOTE, vect_location,
>>>> + if (suggested_unroll_factor == 1)
>>>> + {
>>>> + int assumed_vf = vect_vf_for_cost (loop_vinfo);
>>>> + int unroll_fact = user_unroll / assumed_vf;
>>>> + suggested_unroll_factor = 1 << ceil_log2 (unroll_fact);
>>>
>>> So with SLP we can have non-power-of-two vectorization factors, and
>>> certainly the extra unrolling we apply ontop of determining the VF
>>> has no additional restrictions. So I woder why you go through
>>> 1 << ceil_log2 (unroll_fact) here instead of just using
>>> user_unroll / assumed_vf? IMO a truncating division is OK, we could
>>> argue for rounding? If we apply costing across modes there might
>>> be a mode with the unrolling exactly matching - I'm unsure if we
>>> want to adhere to #pragma GCC unroll irrespective of costing or not.
>>>
>>> That said, as we simply do
>>>
>>> if (applying_suggested_uf)
>>> LOOP_VINFO_VECT_FACTOR (loop_vinfo) *= loop_vinfo-
>>>> suggested_unroll_factor;
>>>
>>> I'd say go with
>>>
>>> suggested_unroll_factor = user_unroll / assumed_vf;
>>>
>>> ?
>>>
>>
>> The reason for it was that there's an assumption in place that the final VF
>> is a power of two. Which is enforced by
>>
>> if (!is_gimple_val (niters_vector))
>> {
>> var = create_tmp_var (type, "bnd");
>> gimple_seq stmts = NULL;
>> niters_vector = force_gimple_operand (niters_vector, &stmts, true, var);
>> gsi_insert_seq_on_edge_immediate (pe, stmts);
>> /* Peeling algorithm guarantees that vector loop bound is at least ONE,
>> we set range information to make niters analyzer's life easier.
>> Note the number of latch iteration value can be TYPE_MAX_VALUE so
>> we have to represent the vector niter TYPE_MAX_VALUE + 1 >> log_vf. */
>> if (stmts != NULL && log_vf)
>
> That’s just executed whe unrolling is enforced? The VF can easily be
> non-power of two, even odd, when SLP is in effect.
>
>> where log_vf is using exact_log2 which doesn't return null on non-power-of-2
>> values,
>> it returns -1;
>>
>> So the check is somewhat incorrect.
>
> Wrong I’d say.
>
>> We'd need
>>
>> if (stmts != NULL && log_vf && ! integer_minus_onep (log_vf))
>> {
>>
>> But I didn't do so because we would then not set range information for
>> niters_vector.
>
> But does the code rely on logVF for the range? Why would it only work for
> that?
> Peeling ensures niter/VF >= 1 even for VF == 3
Ok, I think I’m wrong here. The only odd VF we’d actually chose is 1. That
said, we shouldn’t limit ourselves here and fixup places with this assumption.
Richard
>> So instead I just made sure that the unfoll factor is a power of two, such
>> that the multiplication
>> returns a power of two.
>>
>> Would you like the extra !integer_minus_onep (log_vf)) instead? It does seem
>> that many things
>> would be less optimal. E.g. in
>>
>> /* Create: niters >> log2(vf) */
>> /* If it's known that niters == number of latch executions + 1 doesn't
>> overflow, we can generate niters >> log2(vf); otherwise we generate
>> (niters - vf) >> log2(vf) + 1 by using the fact that we know ratio
>> will be at least one. */
>> log_vf = build_int_cst (type, exact_log2 (const_vf));
>> if (niters_no_overflow)
>> niters_vector = fold_build2 (RSHIFT_EXPR, type, ni_minus_gap, log_vf);
>>
>> even for a small non-power of two unroll factor like, 3 would then become
>> -1, so niters_vector suddenly
>> becomes unknown.
>
> No, but we need to use a division instead of a shift.
>
> Richard
>
>>
>> Thanks,
>> Tamar
>>
>>> The rest of the patch looks OK to me.
>>>
>>> Richard.
>>>
>>>
>>>> + if (suggested_unroll_factor > 1)
>>>> + {
>>>> + if (dump_enabled_p ())
>>>> + dump_printf_loc (MSG_NOTE, vect_location,
>>>> + "setting unroll factor to %d based on user requested "
>>>> + "unroll factor %d and suggested vectorization "
>>>> + "factor: %d\n",
>>>> + suggested_unroll_factor, user_unroll, assumed_vf);
>>>> + }
>>>> + }
>>>> +
>>>> + if (suggested_unroll_factor > 1)
>>>> + {
>>>> + if (dump_enabled_p ())
>>>> + dump_printf_loc (MSG_NOTE, vect_location,
>>>> "***** Re-trying analysis for unrolling"
>>>> " with unroll factor %d and slp %s.\n",
>>>> suggested_unroll_factor,
>>>> slp_done_for_suggested_uf ? "on" : "off");
>>>> - loop_vec_info unroll_vinfo
>>>> - = vect_create_loop_vinfo (loop, shared, loop_form_info, NULL);
>>>> - unroll_vinfo->vector_mode = vector_mode;
>>>> - unroll_vinfo->suggested_unroll_factor = suggested_unroll_factor;
>>>> - opt_result new_res = vect_analyze_loop_2 (unroll_vinfo, fatal, NULL,
>>>> - slp_done_for_suggested_uf);
>>>> - if (new_res)
>>>> - {
>>>> - delete loop_vinfo;
>>>> - loop_vinfo = unroll_vinfo;
>>>> - }
>>>> - else
>>>> - delete unroll_vinfo;
>>>> + loop_vec_info unroll_vinfo
>>>> + = vect_create_loop_vinfo (loop, shared, loop_form_info, NULL);
>>>> + unroll_vinfo->vector_mode = vector_mode;
>>>> + unroll_vinfo->suggested_unroll_factor = suggested_unroll_factor;
>>>> + opt_result new_res
>>>> + = vect_analyze_loop_2 (unroll_vinfo, fatal, NULL,
>>>> + slp_done_for_suggested_uf);
>>>> + if (new_res)
>>>> + {
>>>> + delete loop_vinfo;
>>>> + loop_vinfo = unroll_vinfo;
>>>> + LOOP_VINFO_USER_UNROLL (loop_vinfo) = user_unroll > 1;
>>>> + }
>>>> + else
>>>> + delete unroll_vinfo;
>>>> + }
>>>> }
>>>>
>>>> /* Remember the autodetected vector mode. */
>>>> @@ -12373,6 +12398,13 @@ vect_transform_loop (loop_vec_info loop_vinfo,
>>> gimple *loop_vectorized_call)
>>>> dump_printf_loc (MSG_NOTE, vect_location, "Disabling unrolling due to"
>>>> " variable-length vectorization factor\n");
>>>> }
>>>> +
>>>> + /* When we have unrolled the loop due to a user requested value we
>>>> should
>>>> + leave it up to the RTL unroll heuristics to determine if it's still
>>>> worth
>>>> + while to unroll more. */
>>>> + if (LOOP_VINFO_USER_UNROLL (loop_vinfo))
>>>> + loop->unroll = 0;
>>>> +
>>>> /* Free SLP instances here because otherwise stmt reference counting
>>>> won't work. */
>>>> slp_instance instance;
>>>> diff --git a/gcc/tree-vectorizer.h b/gcc/tree-vectorizer.h
>>>> index
>>> a2f33a5ecd60288fe7f28ee639ff8b6a77667796..8fd8c10ec64f7241d6b097491f
>>> 84400164893911 100644
>>>> --- a/gcc/tree-vectorizer.h
>>>> +++ b/gcc/tree-vectorizer.h
>>>> @@ -970,6 +970,10 @@ public:
>>>> /* Main loop IV cond. */
>>>> gcond* loop_iv_cond;
>>>>
>>>> + /* True if we have an unroll factor requested by the user through
>>>> pragma GCC
>>>> + unroll. */
>>>> + bool user_unroll;
>>>> +
>>>> /* True if there are no loop carried data dependencies in the loop.
>>>> If loop->safelen <= 1, then this is always true, either the loop
>>>> didn't have any loop carried data dependencies, or the loop is being
>>>> @@ -1094,6 +1098,7 @@ public:
>>>> #define LOOP_VINFO_CHECK_UNEQUAL_ADDRS(L) (L)->check_unequal_addrs
>>>> #define LOOP_VINFO_CHECK_NONZERO(L) (L)->check_nonzero
>>>> #define LOOP_VINFO_LOWER_BOUNDS(L) (L)->lower_bounds
>>>> +#define LOOP_VINFO_USER_UNROLL(L) (L)->user_unroll
>>>> #define LOOP_VINFO_GROUPED_STORES(L) (L)->grouped_stores
>>>> #define LOOP_VINFO_SLP_INSTANCES(L) (L)->slp_instances
>>>> #define LOOP_VINFO_SLP_UNROLLING_FACTOR(L) (L)->slp_unrolling_factor
>>>>
>>>
>>> --
>>> Richard Biener <rguent...@suse.de>
>>> SUSE Software Solutions Germany GmbH,
>>> Frankenstrasse 146, 90461 Nuernberg, Germany;
>>> GF: Ivo Totev, Andrew McDonald, Werner Knoblich; (HRB 36809, AG Nuernberg)