Re: [PATCH][RFC] Allow the target to request a masked vector epilogue

Richard Biener Fri, 16 May 2025 04:50:53 -0700

On Fri, 16 May 2025, Richard Sandiford wrote:

> Richard Biener <rguent...@suse.de> writes:
> > Targets recently got the ability to request the vector mode to be
> > used for a vector epilogue (or the epilogue of a vector epilogue).  The
> > following adds the ability for it to indicate the epilogue should use
> > loop masking, irrespective of the --param vect-partial-vector-usage
> > setting.
> 
> Is overriding the --param a good idea?  I can imagine we'd eventually
> want a --param to override the override of the --param. :)


;)  So for x86 my plan is to use options_set to check if the user
explicitly specified the --param and in this case honor it, but
I don't want to have the default of the --param to affect all loops
in the same way.  So the default for x86 would be to let target
heuristics decide.

> > The simple prototype below uses a separate flag from the epilogue
> > mode, but I wonder how we want to more generally want to handle
> > whether to use masking or not when iterating over modes.  Currently
> > we mostly rely on --param vect-partial-vector-usage.  aarch64
> > and riscv have both variable-length modes but also fixed-size modes
> > where for the latter, like on x86, the target couldn't request
> > a mode specifically with or without masking.  It seems both
> > aarch64 and riscv fully rely on cost comparison and fully
> > exploiting the mode iteration space (but not masked vs. non-masked?!)
> > here?
> >
> > I was thinking of adding a vectorization_mode class that would
> > encapsulate the mode and whether to allow masking or alternatively
> > to make the vector_modes array (and the m_suggested_epilogue_mode)
> > a std::pair of mode and mask flag?
> 
> Predicated vs. non-predicated SVE is interesting for the main loop.
> The class sounds like it would be useful for that.
> 
> I suppose predicated vs. non-predicated SVE is also potentially
> interesting for an unrolled epilogue, although there, it would in
> theory be better to predicate only the last vector iteration
> (i.e. part predicated, part unpredicated).

Yes, the latter is what we want for AVX512, keep the main loop
not predicated but have the epilog predicated (using the same VF).

> So I suppose unpredicated SVE epilogue loops might be interesting
> until that partial predication is implemented, but I'm not sure how
> useful unpredicated SVE epilogue loops would be "once" the partial
> predication is supported.
> 
> I don't imagine we'll often know a priori for AArch64 which type
> of vector epilogue is best.  Since switching between SVE and
> Advanced SIMD is assumed to be essentially free, I think we'll
> still rely on the current approach of costing both and seeing
> which is cheaper.

So the other case we might run into on x86 is if you have a
known loop tripcount but fully vectorizing the epilogue is
still not possible because while we have half-SSE, like V8QImode,
we don't have V4QI or V2QI, so even with multiple epilogues
we'd still end up with an iterating scalar epilog.  Those
cases might be good candidates for a predicated epilog as well.
So in the end we'd prefer branchless epilogues.

Predication on x86 is quite a bit more expensive so I don't see
us using a predicated main vector loop anytime soon, and I'd
expect that to be the case for all archs when using a fixed-size
mode?  Is that the case for -msve-vector-bits=X as well?  Is
there an advantage for not using a predicated main vector loop?

Richard.

> Thanks,
> Richard
> 
> >
> > For the x86 case going the prototype way would be sufficient, we
> > wouldn't want to say use a masked AVX epilogue for a AVX512 loop,
> > so any further iteration on epilogue modes if the requested mode
> > would fail to vectorize is OK to be unmasked.
> >
> > Any comments on this?  You are not yet using m_suggested_epilogue_mode
> > to get more than one vector epilogue, this might be a way to add
> > heuristics when to use a masked epilogue.
> >
> > Thanks,
> > Richard.
> >
> >     * tree-vectorizer.h (vector_costs::suggested_epilogue_mode):
> >     Add masked output parameter and return m_masked_epilogue.
> >     (vector_costs::m_masked_epilogue): New tristate flag.
> >     (vector_costs::vector_costs): Initialize m_masked_epilogue.
> >     * tree-vect-loop.cc (vect_analyze_loop_1): Pass in masked
> >     flag to optionally initialize can_use_partial_vectors_p.
> >     (vect_analyze_loop): For epilogues also get whether to use
> >     a masked epilogue for this loop from the target and use
> >     that for the first epilogue mode we try.
> > ---
> >  gcc/tree-vect-loop.cc | 29 +++++++++++++++++++++--------
> >  gcc/tree-vectorizer.h | 12 +++++++++---
> >  2 files changed, 30 insertions(+), 11 deletions(-)
> >
> > diff --git a/gcc/tree-vect-loop.cc b/gcc/tree-vect-loop.cc
> > index 2d1a6883e6b..4af510ff20c 100644
> > --- a/gcc/tree-vect-loop.cc
> > +++ b/gcc/tree-vect-loop.cc
> > @@ -3407,6 +3407,7 @@ vect_analyze_loop_1 (class loop *loop, 
> > vec_info_shared *shared,
> >                  const vect_loop_form_info *loop_form_info,
> >                  loop_vec_info orig_loop_vinfo,
> >                  const vector_modes &vector_modes, unsigned &mode_i,
> > +                int masked_p,
> >                  machine_mode &autodetected_vector_mode,
> >                  bool &fatal)
> >  {
> > @@ -3415,6 +3416,8 @@ vect_analyze_loop_1 (class loop *loop, 
> > vec_info_shared *shared,
> >  
> >    machine_mode vector_mode = vector_modes[mode_i];
> >    loop_vinfo->vector_mode = vector_mode;
> > +  if (masked_p != -1)
> > +    loop_vinfo->can_use_partial_vectors_p = masked_p;
> >    unsigned int suggested_unroll_factor = 1;
> >    unsigned slp_done_for_suggested_uf = 0;
> >  
> > @@ -3580,7 +3583,7 @@ vect_analyze_loop (class loop *loop, gimple 
> > *loop_vectorized_call,
> >        cached_vf_per_mode[last_mode_i] = -1;
> >        opt_loop_vec_info loop_vinfo
> >     = vect_analyze_loop_1 (loop, shared, &loop_form_info,
> > -                          NULL, vector_modes, mode_i,
> > +                          NULL, vector_modes, mode_i, -1,
> >                            autodetected_vector_mode, fatal);
> >        if (fatal)
> >     break;
> > @@ -3665,19 +3668,24 @@ vect_analyze_loop (class loop *loop, gimple 
> > *loop_vectorized_call,
> >       array may contain length-agnostic and length-specific modes.  Their
> >       ordering is not guaranteed, so we could end up picking a mode for the 
> > main
> >       loop that is after the epilogue's optimal mode.  */
> > +  int masked_p = -1;
> >    if (!unlimited_cost_model (loop)
> > -      && first_loop_vinfo->vector_costs->suggested_epilogue_mode () != 
> > VOIDmode)
> > +      && (first_loop_vinfo->vector_costs->suggested_epilogue_mode 
> > (masked_p)
> > +     != VOIDmode))
> >      {
> >        vector_modes[0]
> > -   = first_loop_vinfo->vector_costs->suggested_epilogue_mode ();
> > +   = first_loop_vinfo->vector_costs->suggested_epilogue_mode (masked_p);
> >        cached_vf_per_mode[0] = 0;
> >      }
> >    else
> >      vector_modes[0] = autodetected_vector_mode;
> >    mode_i = 0;
> >  
> > -  bool supports_partial_vectors =
> > -    partial_vectors_supported_p () && param_vect_partial_vector_usage != 0;
> > +  /* ???  If we'd compute LOOP_VINFO_CAN_USE_PARTIAL_VECTORS_P 
> > unconditionally
> > +     for the main loop we could re-use that as short-cut here.  Not if
> > +     the suggested epilogue mode is different, of course.  */
> > +  bool supports_partial_vectors = partial_vectors_supported_p () && 
> > ((param_vect_partial_vector_usage != 0 && masked_p != 0)
> > +                              || masked_p == 1);
> >    poly_uint64 first_vinfo_vf = LOOP_VINFO_VECT_FACTOR (first_loop_vinfo);
> >  
> >    loop_vec_info orig_loop_vinfo = first_loop_vinfo;
> > @@ -3707,7 +3715,7 @@ vect_analyze_loop (class loop *loop, gimple 
> > *loop_vectorized_call,
> >       opt_loop_vec_info loop_vinfo
> >         = vect_analyze_loop_1 (loop, shared, &loop_form_info,
> >                                orig_loop_vinfo,
> > -                              vector_modes, mode_i,
> > +                              vector_modes, mode_i, masked_p,
> >                                autodetected_vector_mode, fatal);
> >       if (fatal)
> >         break;
> > @@ -3738,6 +3746,10 @@ vect_analyze_loop (class loop *loop, gimple 
> > *loop_vectorized_call,
> >             break;
> >         }
> >  
> > +     /* ???  x86 does not have masked / unmasked modes, not sure
> > +        if SVE fixed-size modes would support masking?  That said,
> > +        the modes array should be changed to pair(mode, masked_p).  */
> > +     masked_p = -1;
> >       if (mode_i == vector_modes.length ())
> >         break;
> >     }
> > @@ -3748,12 +3760,13 @@ vect_analyze_loop (class loop *loop, gimple 
> > *loop_vectorized_call,
> >  
> >        /* When we selected a first vectorized epilogue, see if the target
> >      suggests to have another one.  */
> > +      masked_p = -1;
> >        if (!unlimited_cost_model (loop)
> > -     && (orig_loop_vinfo->vector_costs->suggested_epilogue_mode ()
> > +     && (orig_loop_vinfo->vector_costs->suggested_epilogue_mode (masked_p)
> >           != VOIDmode))
> >     {
> >       vector_modes[0]
> > -       = orig_loop_vinfo->vector_costs->suggested_epilogue_mode ();
> > +       = orig_loop_vinfo->vector_costs->suggested_epilogue_mode (masked_p);
> >       cached_vf_per_mode[0] = 0;
> >       mode_i = 0;
> >     }
> > diff --git a/gcc/tree-vectorizer.h b/gcc/tree-vectorizer.h
> > index 7aa2b02b63c..6022bc0d55a 100644
> > --- a/gcc/tree-vectorizer.h
> > +++ b/gcc/tree-vectorizer.h
> > @@ -1693,7 +1693,7 @@ public:
> >    unsigned int outside_cost () const;
> >    unsigned int total_cost () const;
> >    unsigned int suggested_unroll_factor () const;
> > -  machine_mode suggested_epilogue_mode () const;
> > +  machine_mode suggested_epilogue_mode (int &masked) const;
> >  
> >  protected:
> >    unsigned int record_stmt_cost (stmt_vec_info, vect_cost_model_location,
> > @@ -1717,8 +1717,12 @@ protected:
> >    unsigned int m_suggested_unroll_factor;
> >  
> >    /* The suggested mode to be used for a vectorized epilogue or VOIDmode,
> > -     determined at finish_cost.  */
> > +     determined at finish_cost.  m_masked_epilogue is whether in case
> > +     of non-VOIDmode epilogue mode the epilogue should use masked
> > +     vectorization, regardless of --param vect-partial-vector-usage
> > +     setting.  If -1 then the --param setting takes precedence.  */
> >    machine_mode m_suggested_epilogue_mode;
> > +  int m_masked_epilogue;
> >  
> >    /* True if finish_cost has been called.  */
> >    bool m_finished;
> > @@ -1734,6 +1738,7 @@ vector_costs::vector_costs (vec_info *vinfo, bool 
> > costing_for_scalar)
> >      m_costs (),
> >      m_suggested_unroll_factor(1),
> >      m_suggested_epilogue_mode(VOIDmode),
> > +    m_masked_epilogue (-1),
> >      m_finished (false)
> >  {
> >  }
> > @@ -1794,9 +1799,10 @@ vector_costs::suggested_unroll_factor () const
> >  /* Return the suggested epilogue mode.  */
> >  
> >  inline machine_mode
> > -vector_costs::suggested_epilogue_mode () const
> > +vector_costs::suggested_epilogue_mode (int &masked_p) const
> >  {
> >    gcc_checking_assert (m_finished);
> > +  masked_p = m_masked_epilogue;
> >    return m_suggested_epilogue_mode;
> >  }
> 

-- 
Richard Biener <rguent...@suse.de>
SUSE Software Solutions Germany GmbH,
Frankenstrasse 146, 90461 Nuernberg, Germany;
GF: Ivo Totev, Andrew McDonald, Werner Knoblich; (HRB 36809, AG Nuernberg)

Re: [PATCH][RFC] Allow the target to request a masked vector epilogue

Reply via email to