This email is to discuss the optabs and IFNs that I will need to finish off the early break optimizations in GCC 16 and to minimize respins on the final patches.
Optabs only, These will be handled through expand rewriting them if the targets supports them: -- cbranch At the moment we model vector compare and branch using cbranch on vector arguments. e.g. a = b `cmp` c if (a <cmp2> 0) ... where <cmp2> is != or == reflecting whether we want ANY or ALL. The plan is to deprecate this in favor of an explicit vector compare, which allows us to elide the != 0 and == 0. This allows us to be able to generate more efficient code in the backends. For instance for Adv. SIMD we can now use an SVE compare + branch for Adv. SIMD. For this feature 6 new optabs are needed, and cbranch will be made unsupported for vector arguments and I will update all backends. -- vec_cbranch_any, vec_cbranch_all These are unpredicated version of the compares. This means this version also does not have an ELSE argument as all lanes are relevant. cond_vec_cbranch_any, cond_vec_cbranch_all cond_len_vec_cbranch_any, cond_len_vec_cbranch_all These are the predicated equivalent of the above. These take the predicate/len of the operation and also takes an ELSE argument for what happens when the predicate is inactive. I think the default is zero. But I'm unsure how to handle when a target wants something other than zero since this is an expand time rewriting and so I can't reject the expansion. -- vec_cmp At the moment vector compares are always unpredicated. Masking or LENs are applied to the result of the unpredicated compare. E.g. a = b `cmp` c d = a & loop_mask and we rely on combine to push the mask into the compare later on. however with the introduction of cond_vec_cbranch etc we rely on CSE being able to CSE duplicate compares. CSE however runs before combine and so we need a mechanism to expand the compares as predicated from the start. This also allows us to simplify some backend patterns. The proposed optabs are -- cond_vec_cmp, cond_len_vec_cmp which are similar to vec_cmp$a$b such that it takes a condition code as argument and a predicate or len. This is an optimization and so it's an optional optab for targets. A target would need to implement both vec_cmp and cond_vec_cmp to get the optimization. --- Predicate operations The following ones will get new optabs and IFN and will be generated by the vectorizer. At the moment I don't know how the RVV code deals with similar operations. I will assume that for their case they work on masks/predicates as well as that is the result of the vector compares. Do tell me if I need a LEN version of these, but I would appreciate links to instructions so I know how to expect them to be used. These operations are used to allow us to not need the scalar epilogue. -- Predicate partitioning When an early break is vectorized if (a `cmp` b) Then if we want to avoid the scalar the scalar epilogue we would need to handle such cases as return i return a[i] etc. The first one is return i. My patches change it such that we now use a scalar IV for the forced live IVs. Which means return i becomes mask1 = all_before_first (mask) scalar_iv + count_active (mask1); all_before_first and all_after_first map to the SVE instructions BRKB and BRKA essentially all_before_first makes all lanes as active in the predicate until but not including the first active element. So a mask of {0,0,0,1} becomes {1,1,1,0} such that the scalar_iv + 3 which gives us the final value of i. In SVE the count_active becomes CNTP and x + CNTP gets optimized to INCP, so we don't need to model INCP as optabs. The mask1 will also be used if the loop had a store, so we can only do the partials stores outstanding when we exit the loop. i.e. store movement would now also duplicate the stores in the exit block and mask them by mask1. return a[i] is handled by using extract_first_active, in SVE these map to the COMPACT or SPLICE instruction depending on the size of the lanes. The usage would be extract_first_active (vect_a, mask) Note that for this IFN it is undefined what would happen if mask is all false. This is required to match the SVE semantics of the operations but it's guaranteed the vectorizer does only calls the IFN when at least one lane is active. I do not believe I need a LEN version here either? But If If I'm wrong It would be useful to have a small example. Thanks, Tamar