On Thu, 11 Sep 2025, Tamar Christina wrote:

> This email is to discuss the optabs and IFNs that I will need to finish off 
> the
> early break optimizations in GCC 16 and to minimize respins on the final 
> patches.
> 
> Optabs only, These will be handled through expand rewriting them if the 
> targets
> supports them:
> 
> -- cbranch
> 
> At the moment we model vector compare and branch using cbranch on vector
> arguments. e.g.
> 
> a = b `cmp` c
> if (a <cmp2> 0)
>   ...
> 
> where <cmp2> is != or == reflecting whether we want ANY or ALL.
> 
> The plan is to deprecate this in favor of an explicit vector compare, which
> allows us to elide the != 0 and == 0.  This allows us to be able to generate
> more efficient code in the backends.  For instance for Adv. SIMD we can now 
> use
> an SVE compare + branch for Adv. SIMD.
> 
> For this feature 6 new optabs are needed, and cbranch will be made unsupported
> for vector arguments and I will update all backends.
> 
> -- vec_cbranch_any, vec_cbranch_all
> 
> These are unpredicated version of the compares.  This means this version also
> does not have an ELSE argument as all lanes are relevant.
> 
> cond_vec_cbranch_any, cond_vec_cbranch_all
> cond_len_vec_cbranch_any, cond_len_vec_cbranch_all
> 
> These are the predicated equivalent of the above.  These take the 
> predicate/len
> of the operation and also takes an ELSE argument for what happens when the
> predicate is inactive.  I think the default is zero.  But I'm unsure how to
> handle when a target wants something other than zero since this is an expand 
> time
> rewriting and so I can't reject the expansion.

You are rewriting from a GIMPLE_COND, right?  If so the vectorizer
should have verified an appropriate expansion exists (I suggest to
separate a function that selects the optab you can call from both
expansion and vectorization for this)

> -- vec_cmp
> 
> At the moment vector compares are always unpredicated. Masking or LENs are
> applied to the result of the unpredicated compare.
> 
> E.g.
> 
> a = b `cmp` c
> d = a & loop_mask
> 
> and we rely on combine to push the mask into the compare later on.  however 
> with
> the introduction of cond_vec_cbranch etc we rely on CSE being able to CSE 
> duplicate
> compares.  CSE however runs before combine and so we need a mechanism to 
> expand
> the compares as predicated from the start.
> 
> This also allows us to simplify some backend patterns.  The proposed optabs 
> are
> 
> -- cond_vec_cmp, cond_len_vec_cmp
> 
> which are similar to vec_cmp$a$b such that it takes a condition code as 
> argument
> and a predicate or len.  This is an optimization and so it's an optional optab
> for targets.  A target would need to implement both vec_cmp and cond_vec_cmp
> to get the optimization.

So like the branch case this needs an else value, just in case this
wasn't obvious.

> --- Predicate operations
> 
> The following ones will get new optabs and IFN and will be generated by the
> vectorizer.  At the moment I don't know how the RVV code deals with similar
> operations.  I will assume that for their case they work on masks/predicates 
> as
> well as that is the result of the vector compares.  Do tell me if I need a LEN
> version of these, but I would appreciate links to instructions so I know how 
> to
> expect them to be used.
> 
> These operations are used to allow us to not need the scalar epilogue.
> 
> -- Predicate partitioning
> 
> When an early break is vectorized
> 
> if (a `cmp` b)
> 
> Then if we want to avoid the scalar the scalar epilogue we would need to 
> handle
> such cases as
> 
> return i
> return a[i]
> 
> etc.
> 
> The first one is return i.  My patches change it such that we now use a scalar
> IV for the forced live IVs.  Which means return i becomes

So that's possibly an additional scalar IV, right?
 
> mask1 = all_before_first (mask)
> scalar_iv + count_active (mask1);

So 'mask' is the mask for the if (a cmp b), aka it indicates whether
we exited for a lane.  So this is ffs (mask) - 1?

> all_before_first and all_after_first map to the SVE instructions BRKB and BRKA
> essentially all_before_first makes all lanes as active in the predicate until
> but not including the first active element.
> 
> So a mask of {0,0,0,1} becomes {1,1,1,0} such that the scalar_iv + 3 which 
> gives
> us the final value of i.  In SVE the count_active becomes CNTP and x + CNTP
> gets optimized to INCP, so we don't need to model INCP as optabs.
> 
> The mask1 will also be used if the loop had a store, so we can only do the
> partials stores outstanding when we exit the loop.  i.e. store movement would
> now also duplicate the stores in the exit block and mask them by mask1.

So the original scalar IV had an evolution { init, +, step }, the
"vector" IV will have { init, +, VF * step } then, right?  I wonder
whether in case we have an actual vector IV for the IV (because of
other uses), the epilog could use i == vector-i[0], so extract
element zero from a vector IV to compute the scalar i?

> return a[i]
> 
> is handled by using extract_first_active, in SVE these map to the COMPACT or
> SPLICE instruction depending on the size of the lanes. The usage would be

So AVX512 has vcompressp{d,s} and vexpandp{d,s} (but nothing for smaller
integer element types).  Those could be used for this but they have
a vector result (and element zero would be the first active).

But don't you possibly want the last inactive as well, dependent on
whether this is a peeled/not peeled exit?  We could shift the mask
by one for either case.

vcompresspd is not available on AVX2 or SSE4.1, using a vector-vector
permute to get the element 'i' % nunits to lane zero would be another
possibility, also for non-float or double sized elements we need sth
like this.

I do wonder whether we want to have the compress/expand as actual
optabs when we use them.  Having an extract_first (without _active,
following extract_last_optab) is probably OK to abstract this to
some extent.  extract_last doesn't specify what happens if no
bit is set in the mask, fold_extract_last seems to be the same
but with an else value - I wonder whether we should canonicalize
those and thus have an else value for extract_first.

> extract_first_active (vect_a, mask)
> 
> Note that for this IFN it is undefined what would happen if mask is all false.
> This is required to match the SVE semantics of the operations but it's
> guaranteed the vectorizer does only calls the IFN when at least one lane is
> active.
> 
> I do not believe I need a LEN version here either? But If If I'm wrong It 
> would
> be useful to have a small example.

I think you need a len variant unless the mask producer had len
applied with an else value of 0 (IIRC RVV always preferse 'undefined'
as else value).  OTOH the "first" element - if one is set and we never
require 'else' - should work with or without loop masking (with len or
mask).

That said, I do wonder why we have both extract_last and 
fold_extract_last.

Richard.

> 
> Thanks,
> Tamar
> 

-- 
Richard Biener <rguent...@suse.de>
SUSE Software Solutions Germany GmbH,
Frankenstrasse 146, 90461 Nuernberg, Germany;
GF: Ivo Totev, Andrew McDonald, Werner Knoblich; (HRB 36809, AG Nuernberg)

Reply via email to