On Wed, 3 Sep 2025, Richard Sandiford wrote:

> Tamar Christina <[email protected]> writes:
> >> -----Original Message-----
> >> From: Richard Biener <[email protected]>
> >> Sent: Tuesday, September 2, 2025 1:44 PM
> >> To: Tamar Christina <[email protected]>
> >> Cc: [email protected]; nd <[email protected]>
> >> Subject: Re: [PATCH 1/3]middle-end: clear the user unroll flag if the
> >> cost model has
> >> overriden it
> >> 
> >> On Tue, 2 Sep 2025, Tamar Christina wrote:
> >> 
> >> > > What was it that made you propose this change?
> >> >
> >> > When we have a loop of say int and a pragma unroll 4
> >> >
> >> > If the vectorizer picks V4SI as the mode, the requested unroll ended up
> >> > exactly matching the VF. As such the requested unroll is 1 and we don't
> >> > clear the pragma.
> >> >
> >> > So it did honor the requested unroll factor. However since we didn't set
> >> > the unroll amount back and left it at 4 the rtl unroller won't use the
> >> > rtl cost model at all and just unroll the vector loop 4 times.
> >> 
> >> Ah, OK.
> >> 
> >> > This change isn't to bypass the rtl cost model, it's to allow it to be
> >> > used rather than overriding it after vectorization.
> >> 
> >> OK, fine.  But still, consider
> >> 
> >> #pragma unroll 4
> >>  for (int i = 0; i < 64; ++i)
> >>   {
> >>     a[4*i+0] = i;
> >>     a[4*i+1] = i;
> >>     a[4*i+2] = i;
> >>     a[4*i+3] = i;
> >>   }
> >> 
> >> so VF == 1, suggested_unroll_factor == 4.  If we don't up VF to 4
> >> should we still claim we did any unrolling?  If the target suggested
> >> a unroll factor of two, should we instead change ->unroll to 2?
> >> Should the user unroll factor override the vector target one?
> >> 
> >
> > I think the target unroll factor should always win out, primarily because
> > of throughput based costing.  The loop above on a 4 VX system should
> > by the vectorizer already be using VF = 4, suggested_unroll_factor == 4.
> >
> > We also don't ever force unrolling for predicated SVE because for
> > predicated SVE we have to balance predicate throughput limitations
> > of any given CPU.  Having the user unroll factor be able to override
> > the cost model one will almost certainly lead to worse performance
> > in this case.
> 
> FWIW, cause and effect are kind-of the other way around: we request an
> unroll factor for SVE in the normal way, but doing so disables predication,
> thanks to:
> 
>       /* For partial-vector-usage=1, try to push the handling of partial
>        vectors to the epilogue, with the main loop continuing to operate
>        on full vectors.
> 
>        If we are unrolling we also do not want to use partial vectors. This
>        is to avoid the overhead of generating multiple masks and also to
>        avoid having to execute entire iterations of FALSE masked instructions
>        when dealing with one or less full iterations.
> 
>        ??? We could then end up failing to use partial vectors if we
>        decide to peel iterations into a prologue, and if the main loop
>        then ends up processing fewer than VF iterations.  */
>       if ((param_vect_partial_vector_usage == 1
>          || loop_vinfo->suggested_unroll_factor > 1)
>         && !LOOP_VINFO_EPILOGUE_P (loop_vinfo)
>         && !vect_known_niters_smaller_than_vf (loop_vinfo))
>       LOOP_VINFO_EPIL_USING_PARTIAL_VECTORS_P (loop_vinfo) = true;
>       else
>       LOOP_VINFO_USING_PARTIAL_VECTORS_P (loop_vinfo) = true;

Context doesn't show, but I guess this honors
LOOP_VINFO_MUST_USE_PARTIAL_VECTORS_P.

Also I wonder when we don't use partial vectors, does that mean
we are using fixed-length vectors?  Unless -msve-vector-bits is
specified this means using NEON width?  At least I don't remember
seeing non-len-based code to query the actual vector length
at runtime?

> In other words, the choice of unroll factor is an input to the
> predication decision, rather than the predication decision being an
> input to the choice of unroll factor.

In principle this makes sense.

Richard.

Reply via email to