unroll directly in vectorizer

Tamar Christina Tue, 13 May 2025 06:46:55 -0700

> -----Original Message-----
> From: Richard Biener <rguent...@suse.de>
> Sent: Tuesday, May 13, 2025 1:59 PM
> To: Tamar Christina <tamar.christ...@arm.com>
> Cc: gcc-patches@gcc.gnu.org; nd <n...@arm.com>
> Subject: Re: [PATCH 1/2]middle-end: Apply loop->unroll directly in vectorizer
> 
> On Tue, 13 May 2025, Tamar Christina wrote:
> 
> > Hi All,
> >
> > Consider the loop
> >
> > void f1 (int *restrict a, int n)
> > {
> > #pragma GCC unroll 4 requested
> >   for (int i = 0; i < n; i++)
> >     a[i] *= 2;
> > }
> >
> > Which today is vectorized and then unrolled 3x by the RTL unroller due to 
> > the
> > use of the pragma.  This is unfortunate because the pragma was intended for 
> > the
> > scalar loop but we end up with an unrolled vector loop and a longer path to 
> > the
> > entry which has a low enough VF requirement to enter.
> >
> > This patch instead seeds the suggested_unroll_factor with the value the user
> > requested and instead uses it to maintain the total VF that the user wanted 
> > the
> > scalar loop to maintain.
> >
> > In effect it applies the unrolling inside the vector loop itself.  This has 
> > the
> > benefits for things like reductions, as it allows us to split the 
> > accumulator
> > and so the unrolled loop is more efficient.  For early-break it allows the
> > cbranch call to be shared between the unrolled elements, giving you more
> > effective unrolling because it doesn't need the repeated cbranch which can 
> > be
> > expensive.
> >
> > The target can then choose to create multiple epilogues to deal with the 
> > "rest".
> >
> > The example above now generates:
> >
> > .L4:
> >         ldr     q31, [x2]
> >         add     v31.4s, v31.4s, v31.4s
> >         str     q31, [x2], 16
> >         cmp     x2, x3
> >         bne     .L4
> >
> > as V4SI maintains the requested VF, but e.g. pragma unroll 8 generates:
> >
> > .L4:
> >         ldp     q30, q31, [x2]
> >         add     v30.4s, v30.4s, v30.4s
> >         add     v31.4s, v31.4s, v31.4s
> >         stp     q30, q31, [x2], 32
> >         cmp     x3, x2
> >         bne     .L4
> >
> > Bootstrapped Regtested on aarch64-none-linux-gnu,
> > arm-none-linux-gnueabihf, x86_64-pc-linux-gnu
> > -m32, -m64 and no issues.
> >
> > Ok for master?
> >
> > Thanks,
> > Tamar
> >
> > gcc/ChangeLog:
> >
> >     * tree-vectorizer.h (vector_costs::set_suggested_unroll_factor,
> >     LOOP_VINFO_USER_UNROLL): New.
> >     (class _loop_vec_info): Add user_unroll.
> >     * tree-vect-loop.cc (vect_estimate_min_profitable_iters): Set
> >     suggested_unroll_factor before calling backend costing.
> >     (_loop_vec_info::_loop_vec_info): Initialize user_unroll.
> >     (vect_transform_loop): Clear the loop->unroll value if the pragma was
> >     used.
> >
> > ---
> > diff --git a/gcc/tree-vect-loop.cc b/gcc/tree-vect-loop.cc
> > index
> fe6f3cf188e40396b299ff9e814cc402bc2d4e2d..a13e4978bc7ed651be3a65d24
> 3e84c5aaf706f65 100644
> > --- a/gcc/tree-vect-loop.cc
> > +++ b/gcc/tree-vect-loop.cc
> > @@ -1073,6 +1073,7 @@ _loop_vec_info::_loop_vec_info (class loop *loop_in,
> vec_info_shared *shared)
> >      peeling_for_gaps (false),
> >      peeling_for_niter (false),
> >      early_breaks (false),
> > +    user_unroll (false),
> >      no_data_dependencies (false),
> >      has_mask_store (false),
> >      scalar_loop_scaling (profile_probability::uninitialized ()),
> > @@ -4983,6 +4984,26 @@ vect_estimate_min_profitable_iters (loop_vec_info
> loop_vinfo,
> >     }
> >      }
> >
> > +  /* Seed the target cost model with what the user requested if the unroll
> > +     factor is larger than 1 vector VF.  */
> > +  auto user_unroll = LOOP_VINFO_LOOP (loop_vinfo)->unroll;
> > +  if (user_unroll > 1)
> > +    {
> > +      LOOP_VINFO_USER_UNROLL (loop_vinfo) = true;
> > +      int unroll_fact = user_unroll / assumed_vf;
> > +      unroll_fact = 1 << ceil_log2 (unroll_fact);
> > +      if (unroll_fact > 1)
> > +   {
> > +     if (dump_enabled_p ())
> > +       dump_printf_loc (MSG_NOTE, vect_location,
> > +                    "setting unroll factor to %d based on user requested "
> > +                    "unroll factor %d and suggested vectorization "
> > +                    "factor: %d\n",
> > +                    unroll_fact, user_unroll, assumed_vf);
> > +     loop_vinfo->vector_costs->set_suggested_unroll_factor (unroll_fact);
> 
> So usually targets apply this in finish_cost () so the vectorizer
> tries again with the suggested unroll factor.  So that's what we
> then do unless the target overrides the factor again?


Yes, I intended to let the target be able to override the unrolling, because 
with
In particular -mcpu we know about issue rates and throughput limitations.

We can tell when unrolling would result in slower code, for instance if it's
putting too much bottleneck on cbranch for early break for instance.

> 
> But then ...
> 
> > +   }
> > +    }
> > +
> >    /* Complete the target-specific cost calculations.  */
> >    loop_vinfo->vector_costs->finish_cost (loop_vinfo->scalar_costs);
> >    vec_prologue_cost = loop_vinfo->vector_costs->prologue_cost ();
> > @@ -12364,14 +12385,20 @@ vect_transform_loop (loop_vec_info loop_vinfo,
> gimple *loop_vectorized_call)
> >                      GET_MODE_NAME (loop_vinfo->vector_mode));
> >      }
> >
> > -  /* Loops vectorized with a variable factor won't benefit from
> > +  /* Loops vectorized would have already taken into account unrolling 
> > specified
> > +     by the user as the suggested unroll factor, as such we need to 
> > prevent the
> > +     RTL unroller from unrolling twice.  The only exception is static known
> > +     iterations where we would have expected the loop to be fully unrolled.
> > +     Loops vectorized with a variable factor won't benefit from
> >       unrolling/peeling.  */
> > -  if (!vf.is_constant ())
> > +  if (LOOP_VINFO_USER_UNROLL (loop_vinfo)
> 
> ... this is the transform phase - is LOOP_VINFO_USER_UNROLL copied
> from the earlier attempt?

Ah, I see I forgot to copy it when the loop_vinfo is copied..  Will fix.

> 
> If the #pragma unroll number doesn't exactly match the VF and followup
> unrolling cannot make it matched, what should we do?  You unconditionally
> disable any followup unrolling in this case.
> 

Yes. This can also happen because the cost model could have determined that
For this micro-architecture, further unrolling is detrimental. i.e. there's no 
point
In overloading your load units needlessly for instance.

The patch takes the stance that it leaves it up to the backend, and the backend 
is
In control of the vector code.  The pragma was intended to apply to scalar 
anyway.

> If using the cost set suggested-unroll-factor this could have all
> be done from within the target.  I had expected you'd seed
> suggested_unroll_factor by loop->unroll for the first analysis
> attempt somehow, altering the vectorization factor chosen?  But
> it might require quite some refactoring to make that possibility
> not too ugly.
> 
> It also seems to me that we won't prefer wider vectors with
> #pragma unroll, since IIRC with using suggested_unroll_factor
> we keep the chosen vector_mode the same?  In particular with
> 
> #pramga GCC unroll 4
> 
> and SImode data you'd still get VF == 8 with -mavx2, just not
> extra unrolling?  It might be reasonable to expect SSE vectors
> to be chosen there to match the requested unroll factor, even
> when compiling with -mavx2?

Yeah it only focused on using it as an unrolling factor.  As I thought that's 
what
you meant with seeding the unroll factor.   I guess instead if we want
to consider wider vector modes then it should instead increase the VF itself.

But I'm wondering what happens if we do so it and it can't find a mode to 
satisfy
It. Increasing the VF runs the risk that the vectorizer thinks it can't 
vectorize
The loop doesn't it?

I did when I was trying this first try to increase assumed_vf, and have
It calculate it based on that.  But it would ICE later on if the unroll factor 
was
Too high. E.g. something silly like pragma GCC unroll 80.

But did you want to go down the road of increasing VF instead?

Thanks,
Tamar

> 
> Richard.
> 
> > +      || !vf.is_constant ())
> >      {
> >        loop->unroll = 1;
> >        if (dump_enabled_p ())
> >     dump_printf_loc (MSG_NOTE, vect_location, "Disabling unrolling due to"
> > -                    " variable-length vectorization factor\n");
> > +                    " having already considered vector unrolling or "
> > +                    "variable-length vectorization factor.\n");
> >      }
> >    /* Free SLP instances here because otherwise stmt reference counting
> >       won't work.  */
> > diff --git a/gcc/tree-vectorizer.h b/gcc/tree-vectorizer.h
> > index
> a2f33a5ecd60288fe7f28ee639ff8b6a77667796..5675309ab16dc7f7f0b98e229
> e4798075e6cdd7e 100644
> > --- a/gcc/tree-vectorizer.h
> > +++ b/gcc/tree-vectorizer.h
> > @@ -970,6 +970,10 @@ public:
> >    /* Main loop IV cond.  */
> >    gcond* loop_iv_cond;
> >
> > +  /* True if we have an unroll factor requested by the user through pragma 
> > GCC
> > +     unroll.  */
> > +  bool user_unroll;
> > +
> >    /* True if there are no loop carried data dependencies in the loop.
> >       If loop->safelen <= 1, then this is always true, either the loop
> >       didn't have any loop carried data dependencies, or the loop is being
> > @@ -1094,6 +1098,7 @@ public:
> >  #define LOOP_VINFO_CHECK_UNEQUAL_ADDRS(L)  (L)->check_unequal_addrs
> >  #define LOOP_VINFO_CHECK_NONZERO(L)        (L)->check_nonzero
> >  #define LOOP_VINFO_LOWER_BOUNDS(L)         (L)->lower_bounds
> > +#define LOOP_VINFO_USER_UNROLL(L)          (L)->user_unroll
> >  #define LOOP_VINFO_GROUPED_STORES(L)       (L)->grouped_stores
> >  #define LOOP_VINFO_SLP_INSTANCES(L)        (L)->slp_instances
> >  #define LOOP_VINFO_SLP_UNROLLING_FACTOR(L) (L)->slp_unrolling_factor
> > @@ -1693,6 +1698,7 @@ public:
> >    unsigned int outside_cost () const;
> >    unsigned int total_cost () const;
> >    unsigned int suggested_unroll_factor () const;
> > +  void set_suggested_unroll_factor (unsigned int);
> >    machine_mode suggested_epilogue_mode () const;
> >
> >  protected:
> > @@ -1791,6 +1797,15 @@ vector_costs::suggested_unroll_factor () const
> >    return m_suggested_unroll_factor;
> >  }
> >
> > +/* Set the suggested unroll factor.  */
> > +
> > +inline void
> > +vector_costs::set_suggested_unroll_factor (unsigned unroll_fact)
> > +{
> > +  gcc_checking_assert (!m_finished);
> > +  m_suggested_unroll_factor = unroll_fact;
> > +}
> > +
> >  /* Return the suggested epilogue mode.  */
> >
> >  inline machine_mode
> >
> >
> >
> 
> --
> Richard Biener <rguent...@suse.de>
> SUSE Software Solutions Germany GmbH,
> Frankenstrasse 146, 90461 Nuernberg, Germany;
> GF: Ivo Totev, Andrew McDonald, Werner Knoblich; (HRB 36809, AG Nuernberg)

RE: [PATCH 1/2]middle-end: Apply loop->unroll directly in vectorizer

Reply via email to