unroll directly in vectorizer

Richard Biener Tue, 13 May 2025 09:37:22 -0700

On Tue, 13 May 2025, Tamar Christina wrote:

> > -----Original Message-----
> > From: Richard Biener <rguent...@suse.de>
> > Sent: Tuesday, May 13, 2025 1:59 PM
> > To: Tamar Christina <tamar.christ...@arm.com>
> > Cc: gcc-patches@gcc.gnu.org; nd <n...@arm.com>
> > Subject: Re: [PATCH 1/2]middle-end: Apply loop->unroll directly in 
> > vectorizer
> > 
> > On Tue, 13 May 2025, Tamar Christina wrote:
> > 
> > > Hi All,
> > >
> > > Consider the loop
> > >
> > > void f1 (int *restrict a, int n)
> > > {
> > > #pragma GCC unroll 4 requested
> > >   for (int i = 0; i < n; i++)
> > >     a[i] *= 2;
> > > }
> > >
> > > Which today is vectorized and then unrolled 3x by the RTL unroller due to 
> > > the
> > > use of the pragma.  This is unfortunate because the pragma was intended 
> > > for the
> > > scalar loop but we end up with an unrolled vector loop and a longer path 
> > > to the
> > > entry which has a low enough VF requirement to enter.
> > >
> > > This patch instead seeds the suggested_unroll_factor with the value the 
> > > user
> > > requested and instead uses it to maintain the total VF that the user 
> > > wanted the
> > > scalar loop to maintain.
> > >
> > > In effect it applies the unrolling inside the vector loop itself.  This 
> > > has the
> > > benefits for things like reductions, as it allows us to split the 
> > > accumulator
> > > and so the unrolled loop is more efficient.  For early-break it allows the
> > > cbranch call to be shared between the unrolled elements, giving you more
> > > effective unrolling because it doesn't need the repeated cbranch which 
> > > can be
> > > expensive.
> > >
> > > The target can then choose to create multiple epilogues to deal with the 
> > > "rest".
> > >
> > > The example above now generates:
> > >
> > > .L4:
> > >         ldr     q31, [x2]
> > >         add     v31.4s, v31.4s, v31.4s
> > >         str     q31, [x2], 16
> > >         cmp     x2, x3
> > >         bne     .L4
> > >
> > > as V4SI maintains the requested VF, but e.g. pragma unroll 8 generates:
> > >
> > > .L4:
> > >         ldp     q30, q31, [x2]
> > >         add     v30.4s, v30.4s, v30.4s
> > >         add     v31.4s, v31.4s, v31.4s
> > >         stp     q30, q31, [x2], 32
> > >         cmp     x3, x2
> > >         bne     .L4
> > >
> > > Bootstrapped Regtested on aarch64-none-linux-gnu,
> > > arm-none-linux-gnueabihf, x86_64-pc-linux-gnu
> > > -m32, -m64 and no issues.
> > >
> > > Ok for master?
> > >
> > > Thanks,
> > > Tamar
> > >
> > > gcc/ChangeLog:
> > >
> > >   * tree-vectorizer.h (vector_costs::set_suggested_unroll_factor,
> > >   LOOP_VINFO_USER_UNROLL): New.
> > >   (class _loop_vec_info): Add user_unroll.
> > >   * tree-vect-loop.cc (vect_estimate_min_profitable_iters): Set
> > >   suggested_unroll_factor before calling backend costing.
> > >   (_loop_vec_info::_loop_vec_info): Initialize user_unroll.
> > >   (vect_transform_loop): Clear the loop->unroll value if the pragma was
> > >   used.
> > >
> > > ---
> > > diff --git a/gcc/tree-vect-loop.cc b/gcc/tree-vect-loop.cc
> > > index
> > fe6f3cf188e40396b299ff9e814cc402bc2d4e2d..a13e4978bc7ed651be3a65d24
> > 3e84c5aaf706f65 100644
> > > --- a/gcc/tree-vect-loop.cc
> > > +++ b/gcc/tree-vect-loop.cc
> > > @@ -1073,6 +1073,7 @@ _loop_vec_info::_loop_vec_info (class loop *loop_in,
> > vec_info_shared *shared)
> > >      peeling_for_gaps (false),
> > >      peeling_for_niter (false),
> > >      early_breaks (false),
> > > +    user_unroll (false),
> > >      no_data_dependencies (false),
> > >      has_mask_store (false),
> > >      scalar_loop_scaling (profile_probability::uninitialized ()),
> > > @@ -4983,6 +4984,26 @@ vect_estimate_min_profitable_iters (loop_vec_info
> > loop_vinfo,
> > >   }
> > >      }
> > >
> > > +  /* Seed the target cost model with what the user requested if the 
> > > unroll
> > > +     factor is larger than 1 vector VF.  */
> > > +  auto user_unroll = LOOP_VINFO_LOOP (loop_vinfo)->unroll;
> > > +  if (user_unroll > 1)
> > > +    {
> > > +      LOOP_VINFO_USER_UNROLL (loop_vinfo) = true;
> > > +      int unroll_fact = user_unroll / assumed_vf;
> > > +      unroll_fact = 1 << ceil_log2 (unroll_fact);
> > > +      if (unroll_fact > 1)
> > > + {
> > > +   if (dump_enabled_p ())
> > > +     dump_printf_loc (MSG_NOTE, vect_location,
> > > +                  "setting unroll factor to %d based on user requested "
> > > +                  "unroll factor %d and suggested vectorization "
> > > +                  "factor: %d\n",
> > > +                  unroll_fact, user_unroll, assumed_vf);
> > > +   loop_vinfo->vector_costs->set_suggested_unroll_factor (unroll_fact);
> > 
> > So usually targets apply this in finish_cost () so the vectorizer
> > tries again with the suggested unroll factor.  So that's what we
> > then do unless the target overrides the factor again?
> 
> Yes, I intended to let the target be able to override the unrolling, because 
> with
> In particular -mcpu we know about issue rates and throughput limitations.
> 
> We can tell when unrolling would result in slower code, for instance if it's
> putting too much bottleneck on cbranch for early break for instance.
> 
> > 
> > But then ...
> > 
> > > + }
> > > +    }
> > > +
> > >    /* Complete the target-specific cost calculations.  */
> > >    loop_vinfo->vector_costs->finish_cost (loop_vinfo->scalar_costs);
> > >    vec_prologue_cost = loop_vinfo->vector_costs->prologue_cost ();
> > > @@ -12364,14 +12385,20 @@ vect_transform_loop (loop_vec_info loop_vinfo,
> > gimple *loop_vectorized_call)
> > >                    GET_MODE_NAME (loop_vinfo->vector_mode));
> > >      }
> > >
> > > -  /* Loops vectorized with a variable factor won't benefit from
> > > +  /* Loops vectorized would have already taken into account unrolling 
> > > specified
> > > +     by the user as the suggested unroll factor, as such we need to 
> > > prevent the
> > > +     RTL unroller from unrolling twice.  The only exception is static 
> > > known
> > > +     iterations where we would have expected the loop to be fully 
> > > unrolled.
> > > +     Loops vectorized with a variable factor won't benefit from
> > >       unrolling/peeling.  */
> > > -  if (!vf.is_constant ())
> > > +  if (LOOP_VINFO_USER_UNROLL (loop_vinfo)
> > 
> > ... this is the transform phase - is LOOP_VINFO_USER_UNROLL copied
> > from the earlier attempt?
> 
> Ah, I see I forgot to copy it when the loop_vinfo is copied..  Will fix.
> 
> > 
> > If the #pragma unroll number doesn't exactly match the VF and followup
> > unrolling cannot make it matched, what should we do?  You unconditionally
> > disable any followup unrolling in this case.
> > 
> 
> Yes. This can also happen because the cost model could have determined that
> For this micro-architecture, further unrolling is detrimental. i.e. there's 
> no point
> In overloading your load units needlessly for instance.
> 
> The patch takes the stance that it leaves it up to the backend, and the 
> backend is
> In control of the vector code.  The pragma was intended to apply to scalar 
> anyway.
> 
> > If using the cost set suggested-unroll-factor this could have all
> > be done from within the target.  I had expected you'd seed
> > suggested_unroll_factor by loop->unroll for the first analysis
> > attempt somehow, altering the vectorization factor chosen?  But
> > it might require quite some refactoring to make that possibility
> > not too ugly.
> > 
> > It also seems to me that we won't prefer wider vectors with
> > #pragma unroll, since IIRC with using suggested_unroll_factor
> > we keep the chosen vector_mode the same?  In particular with
> > 
> > #pramga GCC unroll 4
> > 
> > and SImode data you'd still get VF == 8 with -mavx2, just not
> > extra unrolling?  It might be reasonable to expect SSE vectors
> > to be chosen there to match the requested unroll factor, even
> > when compiling with -mavx2?
> 
> Yeah it only focused on using it as an unrolling factor.  As I thought that's 
> what
> you meant with seeding the unroll factor.   I guess instead if we want
> to consider wider vector modes then it should instead increase the VF itself.
> 
> But I'm wondering what happens if we do so it and it can't find a mode to 
> satisfy
> It. Increasing the VF runs the risk that the vectorizer thinks it can't 
> vectorize
> The loop doesn't it?


No, it's more complicated give we fix the vector mode used without
considering the VF at all, which is why I said it might need quite some
refactoring to allow the VF to determine the mode used.

> I did when I was trying this first try to increase assumed_vf, and have
> It calculate it based on that.  But it would ICE later on if the unroll 
> factor was
> Too high. E.g. something silly like pragma GCC unroll 80.
> 
> But did you want to go down the road of increasing VF instead?

I did originally think that this is what we should do - but I don't
think we're ready for this yet.  Note applying the suggested unroll
factor also simply increases the VF - but by a integer factor from
what is determined by earlier analysis.  And that feature is designed
around leaving the mode the same - it just applies extra unrolling.

In the end whatever we do it's going to be a matter of documenting
the interaction between vectorization and #pragma GCC unroll.

The way you handle it is reasonable, the question is whether to
set loop->unroll to 1 in the end (disable any further unrolling)
or to 0 (only auto-unroll based on heuristics).  I'd argue 0
makes more sense - iff we chose to apply the extra unrolling
during vectorization.

Richard.

> Thanks,
> Tamar
> 
> > 
> > Richard.
> > 
> > > +      || !vf.is_constant ())
> > >      {
> > >        loop->unroll = 1;
> > >        if (dump_enabled_p ())
> > >   dump_printf_loc (MSG_NOTE, vect_location, "Disabling unrolling due to"
> > > -                  " variable-length vectorization factor\n");
> > > +                  " having already considered vector unrolling or "
> > > +                  "variable-length vectorization factor.\n");
> > >      }
> > >    /* Free SLP instances here because otherwise stmt reference counting
> > >       won't work.  */
> > > diff --git a/gcc/tree-vectorizer.h b/gcc/tree-vectorizer.h
> > > index
> > a2f33a5ecd60288fe7f28ee639ff8b6a77667796..5675309ab16dc7f7f0b98e229
> > e4798075e6cdd7e 100644
> > > --- a/gcc/tree-vectorizer.h
> > > +++ b/gcc/tree-vectorizer.h
> > > @@ -970,6 +970,10 @@ public:
> > >    /* Main loop IV cond.  */
> > >    gcond* loop_iv_cond;
> > >
> > > +  /* True if we have an unroll factor requested by the user through 
> > > pragma GCC
> > > +     unroll.  */
> > > +  bool user_unroll;
> > > +
> > >    /* True if there are no loop carried data dependencies in the loop.
> > >       If loop->safelen <= 1, then this is always true, either the loop
> > >       didn't have any loop carried data dependencies, or the loop is being
> > > @@ -1094,6 +1098,7 @@ public:
> > >  #define LOOP_VINFO_CHECK_UNEQUAL_ADDRS(L)  (L)->check_unequal_addrs
> > >  #define LOOP_VINFO_CHECK_NONZERO(L)        (L)->check_nonzero
> > >  #define LOOP_VINFO_LOWER_BOUNDS(L)         (L)->lower_bounds
> > > +#define LOOP_VINFO_USER_UNROLL(L)          (L)->user_unroll
> > >  #define LOOP_VINFO_GROUPED_STORES(L)       (L)->grouped_stores
> > >  #define LOOP_VINFO_SLP_INSTANCES(L)        (L)->slp_instances
> > >  #define LOOP_VINFO_SLP_UNROLLING_FACTOR(L) (L)->slp_unrolling_factor
> > > @@ -1693,6 +1698,7 @@ public:
> > >    unsigned int outside_cost () const;
> > >    unsigned int total_cost () const;
> > >    unsigned int suggested_unroll_factor () const;
> > > +  void set_suggested_unroll_factor (unsigned int);
> > >    machine_mode suggested_epilogue_mode () const;
> > >
> > >  protected:
> > > @@ -1791,6 +1797,15 @@ vector_costs::suggested_unroll_factor () const
> > >    return m_suggested_unroll_factor;
> > >  }
> > >
> > > +/* Set the suggested unroll factor.  */
> > > +
> > > +inline void
> > > +vector_costs::set_suggested_unroll_factor (unsigned unroll_fact)
> > > +{
> > > +  gcc_checking_assert (!m_finished);
> > > +  m_suggested_unroll_factor = unroll_fact;
> > > +}
> > > +
> > >  /* Return the suggested epilogue mode.  */
> > >
> > >  inline machine_mode
> > >
> > >
> > >
> > 
> > --
> > Richard Biener <rguent...@suse.de>
> > SUSE Software Solutions Germany GmbH,
> > Frankenstrasse 146, 90461 Nuernberg, Germany;
> > GF: Ivo Totev, Andrew McDonald, Werner Knoblich; (HRB 36809, AG Nuernberg)
> 

-- 
Richard Biener <rguent...@suse.de>
SUSE Software Solutions Germany GmbH,
Frankenstrasse 146, 90461 Nuernberg, Germany;
GF: Ivo Totev, Andrew McDonald, Werner Knoblich; (HRB 36809, AG Nuernberg)

RE: [PATCH 1/2]middle-end: Apply loop->unroll directly in vectorizer

Reply via email to