This email is to discuss the optabs and IFNs that I will need to finish off the
early break optimizations in GCC 16 and to minimize respins on the final 
patches.

Optabs only, These will be handled through expand rewriting them if the targets
supports them:

-- cbranch

At the moment we model vector compare and branch using cbranch on vector
arguments. e.g.

a = b `cmp` c
if (a <cmp2> 0)
  ...

where <cmp2> is != or == reflecting whether we want ANY or ALL.

The plan is to deprecate this in favor of an explicit vector compare, which
allows us to elide the != 0 and == 0.  This allows us to be able to generate
more efficient code in the backends.  For instance for Adv. SIMD we can now use
an SVE compare + branch for Adv. SIMD.

For this feature 6 new optabs are needed, and cbranch will be made unsupported
for vector arguments and I will update all backends.

-- vec_cbranch_any, vec_cbranch_all

These are unpredicated version of the compares.  This means this version also
does not have an ELSE argument as all lanes are relevant.

cond_vec_cbranch_any, cond_vec_cbranch_all
cond_len_vec_cbranch_any, cond_len_vec_cbranch_all

These are the predicated equivalent of the above.  These take the predicate/len
of the operation and also takes an ELSE argument for what happens when the
predicate is inactive.  I think the default is zero.  But I'm unsure how to
handle when a target wants something other than zero since this is an expand 
time
rewriting and so I can't reject the expansion.

-- vec_cmp

At the moment vector compares are always unpredicated. Masking or LENs are
applied to the result of the unpredicated compare.

E.g.

a = b `cmp` c
d = a & loop_mask

and we rely on combine to push the mask into the compare later on.  however with
the introduction of cond_vec_cbranch etc we rely on CSE being able to CSE 
duplicate
compares.  CSE however runs before combine and so we need a mechanism to expand
the compares as predicated from the start.

This also allows us to simplify some backend patterns.  The proposed optabs are

-- cond_vec_cmp, cond_len_vec_cmp

which are similar to vec_cmp$a$b such that it takes a condition code as argument
and a predicate or len.  This is an optimization and so it's an optional optab
for targets.  A target would need to implement both vec_cmp and cond_vec_cmp
to get the optimization.

--- Predicate operations

The following ones will get new optabs and IFN and will be generated by the
vectorizer.  At the moment I don't know how the RVV code deals with similar
operations.  I will assume that for their case they work on masks/predicates as
well as that is the result of the vector compares.  Do tell me if I need a LEN
version of these, but I would appreciate links to instructions so I know how to
expect them to be used.

These operations are used to allow us to not need the scalar epilogue.

-- Predicate partitioning

When an early break is vectorized

if (a `cmp` b)

Then if we want to avoid the scalar the scalar epilogue we would need to handle
such cases as

return i
return a[i]

etc.

The first one is return i.  My patches change it such that we now use a scalar
IV for the forced live IVs.  Which means return i becomes

mask1 = all_before_first (mask)
scalar_iv + count_active (mask1);

all_before_first and all_after_first map to the SVE instructions BRKB and BRKA
essentially all_before_first makes all lanes as active in the predicate until
but not including the first active element.

So a mask of {0,0,0,1} becomes {1,1,1,0} such that the scalar_iv + 3 which gives
us the final value of i.  In SVE the count_active becomes CNTP and x + CNTP
gets optimized to INCP, so we don't need to model INCP as optabs.

The mask1 will also be used if the loop had a store, so we can only do the
partials stores outstanding when we exit the loop.  i.e. store movement would
now also duplicate the stores in the exit block and mask them by mask1.

return a[i]

is handled by using extract_first_active, in SVE these map to the COMPACT or
SPLICE instruction depending on the size of the lanes. The usage would be

extract_first_active (vect_a, mask)

Note that for this IFN it is undefined what would happen if mask is all false.
This is required to match the SVE semantics of the operations but it's
guaranteed the vectorizer does only calls the IFN when at least one lane is
active.

I do not believe I need a LEN version here either? But If If I'm wrong It would
be useful to have a small example.

Thanks,
Tamar

Reply via email to