Richard Biener <rguent...@suse.de> writes: > On Fri, 16 May 2025, Richard Sandiford wrote: >> > The simple prototype below uses a separate flag from the epilogue >> > mode, but I wonder how we want to more generally want to handle >> > whether to use masking or not when iterating over modes. Currently >> > we mostly rely on --param vect-partial-vector-usage. aarch64 >> > and riscv have both variable-length modes but also fixed-size modes >> > where for the latter, like on x86, the target couldn't request >> > a mode specifically with or without masking. It seems both >> > aarch64 and riscv fully rely on cost comparison and fully >> > exploiting the mode iteration space (but not masked vs. non-masked?!) >> > here? >> > >> > I was thinking of adding a vectorization_mode class that would >> > encapsulate the mode and whether to allow masking or alternatively >> > to make the vector_modes array (and the m_suggested_epilogue_mode) >> > a std::pair of mode and mask flag? >> >> Predicated vs. non-predicated SVE is interesting for the main loop. >> The class sounds like it would be useful for that. >> >> I suppose predicated vs. non-predicated SVE is also potentially >> interesting for an unrolled epilogue, although there, it would in >> theory be better to predicate only the last vector iteration >> (i.e. part predicated, part unpredicated). > > Yes, the latter is what we want for AVX512, keep the main loop > not predicated but have the epilog predicated (using the same VF).
Reading it back, what I said was very ambiguous (as usual, unfortunately). What I actually meant was that if we had, say, a 4x unrolled main loop and a 2x unrolled first epilogue loop, we'd in theory want the 2x unrolled epilogue loop to use unpredicated operations for the first VF/2 elements and predicted operations for the second VF/2 elements. That way, we get the benefit of the 2x unrolling for residues of >VF elements, but skip to a second epilogue if there are VF or fewer remaining elements. That example assumes that the last quarter of each iteration of the main loop is predicated in a similar way, with the rest of the iteration being unpredicated. Alternatively, we could have a fully-unpredicated 2x unrolled main loop followed by the same kind of semi-predicated 2x unrolled epilogue loop. So if U == unpredicated and P == predicated: main loop: U U U P 1st epilogue loop: U P 2nd epilogue loop: P 1st and 2nd epilogues might both be used or: main loop: U U 1st epilogue loop: U P 2nd epilogue loop: P 1st and 2nd epilogues are mutually exclusive although the epilogues don't need to loop in either case. >> So I suppose unpredicated SVE epilogue loops might be interesting >> until that partial predication is implemented, but I'm not sure how >> useful unpredicated SVE epilogue loops would be "once" the partial >> predication is supported. >> >> I don't imagine we'll often know a priori for AArch64 which type >> of vector epilogue is best. Since switching between SVE and >> Advanced SIMD is assumed to be essentially free, I think we'll >> still rely on the current approach of costing both and seeing >> which is cheaper. > > So the other case we might run into on x86 is if you have a > known loop tripcount but fully vectorizing the epilogue is > still not possible because while we have half-SSE, like V8QImode, > we don't have V4QI or V2QI, so even with multiple epilogues > we'd still end up with an iterating scalar epilog. Those > cases might be good candidates for a predicated epilog as well. > So in the end we'd prefer branchless epilogues. Yeah, branchless is also the aim of the schemes above. > Predication on x86 is quite a bit more expensive so I don't see > us using a predicated main vector loop anytime soon, and I'd > expect that to be the case for all archs when using a fixed-size > mode? Is that the case for -msve-vector-bits=X as well? Is > there an advantage for not using a predicated main vector loop? I think it depends on the size of the loop. I've seen large HPC loops for which the overhead of predication and loop control is subsumed by the inherent complexity of the work, and duplicating the whole thing would probably be counterproductive. But yeah, for tighter loops, SVE should benefit from unpredicated main loops too. Thanks, Richard