On Thu, 11 Sep 2025, Tamar Christina wrote: > This email is to discuss the optabs and IFNs that I will need to finish off > the > early break optimizations in GCC 16 and to minimize respins on the final > patches. > > Optabs only, These will be handled through expand rewriting them if the > targets > supports them: > > -- cbranch > > At the moment we model vector compare and branch using cbranch on vector > arguments. e.g. > > a = b `cmp` c > if (a <cmp2> 0) > ... > > where <cmp2> is != or == reflecting whether we want ANY or ALL. > > The plan is to deprecate this in favor of an explicit vector compare, which > allows us to elide the != 0 and == 0. This allows us to be able to generate > more efficient code in the backends. For instance for Adv. SIMD we can now > use > an SVE compare + branch for Adv. SIMD. > > For this feature 6 new optabs are needed, and cbranch will be made unsupported > for vector arguments and I will update all backends. > > -- vec_cbranch_any, vec_cbranch_all > > These are unpredicated version of the compares. This means this version also > does not have an ELSE argument as all lanes are relevant. > > cond_vec_cbranch_any, cond_vec_cbranch_all > cond_len_vec_cbranch_any, cond_len_vec_cbranch_all > > These are the predicated equivalent of the above. These take the > predicate/len > of the operation and also takes an ELSE argument for what happens when the > predicate is inactive. I think the default is zero. But I'm unsure how to > handle when a target wants something other than zero since this is an expand > time > rewriting and so I can't reject the expansion.
You are rewriting from a GIMPLE_COND, right? If so the vectorizer should have verified an appropriate expansion exists (I suggest to separate a function that selects the optab you can call from both expansion and vectorization for this) > -- vec_cmp > > At the moment vector compares are always unpredicated. Masking or LENs are > applied to the result of the unpredicated compare. > > E.g. > > a = b `cmp` c > d = a & loop_mask > > and we rely on combine to push the mask into the compare later on. however > with > the introduction of cond_vec_cbranch etc we rely on CSE being able to CSE > duplicate > compares. CSE however runs before combine and so we need a mechanism to > expand > the compares as predicated from the start. > > This also allows us to simplify some backend patterns. The proposed optabs > are > > -- cond_vec_cmp, cond_len_vec_cmp > > which are similar to vec_cmp$a$b such that it takes a condition code as > argument > and a predicate or len. This is an optimization and so it's an optional optab > for targets. A target would need to implement both vec_cmp and cond_vec_cmp > to get the optimization. So like the branch case this needs an else value, just in case this wasn't obvious. > --- Predicate operations > > The following ones will get new optabs and IFN and will be generated by the > vectorizer. At the moment I don't know how the RVV code deals with similar > operations. I will assume that for their case they work on masks/predicates > as > well as that is the result of the vector compares. Do tell me if I need a LEN > version of these, but I would appreciate links to instructions so I know how > to > expect them to be used. > > These operations are used to allow us to not need the scalar epilogue. > > -- Predicate partitioning > > When an early break is vectorized > > if (a `cmp` b) > > Then if we want to avoid the scalar the scalar epilogue we would need to > handle > such cases as > > return i > return a[i] > > etc. > > The first one is return i. My patches change it such that we now use a scalar > IV for the forced live IVs. Which means return i becomes So that's possibly an additional scalar IV, right? > mask1 = all_before_first (mask) > scalar_iv + count_active (mask1); So 'mask' is the mask for the if (a cmp b), aka it indicates whether we exited for a lane. So this is ffs (mask) - 1? > all_before_first and all_after_first map to the SVE instructions BRKB and BRKA > essentially all_before_first makes all lanes as active in the predicate until > but not including the first active element. > > So a mask of {0,0,0,1} becomes {1,1,1,0} such that the scalar_iv + 3 which > gives > us the final value of i. In SVE the count_active becomes CNTP and x + CNTP > gets optimized to INCP, so we don't need to model INCP as optabs. > > The mask1 will also be used if the loop had a store, so we can only do the > partials stores outstanding when we exit the loop. i.e. store movement would > now also duplicate the stores in the exit block and mask them by mask1. So the original scalar IV had an evolution { init, +, step }, the "vector" IV will have { init, +, VF * step } then, right? I wonder whether in case we have an actual vector IV for the IV (because of other uses), the epilog could use i == vector-i[0], so extract element zero from a vector IV to compute the scalar i? > return a[i] > > is handled by using extract_first_active, in SVE these map to the COMPACT or > SPLICE instruction depending on the size of the lanes. The usage would be So AVX512 has vcompressp{d,s} and vexpandp{d,s} (but nothing for smaller integer element types). Those could be used for this but they have a vector result (and element zero would be the first active). But don't you possibly want the last inactive as well, dependent on whether this is a peeled/not peeled exit? We could shift the mask by one for either case. vcompresspd is not available on AVX2 or SSE4.1, using a vector-vector permute to get the element 'i' % nunits to lane zero would be another possibility, also for non-float or double sized elements we need sth like this. I do wonder whether we want to have the compress/expand as actual optabs when we use them. Having an extract_first (without _active, following extract_last_optab) is probably OK to abstract this to some extent. extract_last doesn't specify what happens if no bit is set in the mask, fold_extract_last seems to be the same but with an else value - I wonder whether we should canonicalize those and thus have an else value for extract_first. > extract_first_active (vect_a, mask) > > Note that for this IFN it is undefined what would happen if mask is all false. > This is required to match the SVE semantics of the operations but it's > guaranteed the vectorizer does only calls the IFN when at least one lane is > active. > > I do not believe I need a LEN version here either? But If If I'm wrong It > would > be useful to have a small example. I think you need a len variant unless the mask producer had len applied with an else value of 0 (IIRC RVV always preferse 'undefined' as else value). OTOH the "first" element - if one is set and we never require 'else' - should work with or without loop masking (with len or mask). That said, I do wonder why we have both extract_last and fold_extract_last. Richard. > > Thanks, > Tamar > -- Richard Biener <rguent...@suse.de> SUSE Software Solutions Germany GmbH, Frankenstrasse 146, 90461 Nuernberg, Germany; GF: Ivo Totev, Andrew McDonald, Werner Knoblich; (HRB 36809, AG Nuernberg)