Richard Biener <rguent...@suse.de> writes: > On Mon, 19 May 2025, Richard Sandiford wrote: > >> Richard Biener <rguent...@suse.de> writes: >>> On Fri, 16 May 2025, Richard Sandiford wrote: >>>>> The simple prototype below uses a separate flag from the epilogue >>>>> mode, but I wonder how we want to more generally want to handle >>>>> whether to use masking or not when iterating over modes. Currently >>>>> we mostly rely on --param vect-partial-vector-usage. aarch64 >>>>> and riscv have both variable-length modes but also fixed-size modes >>>>> where for the latter, like on x86, the target couldn't request >>>>> a mode specifically with or without masking. It seems both >>>>> aarch64 and riscv fully rely on cost comparison and fully >>>>> exploiting the mode iteration space (but not masked vs. non-masked?!) >>>>> here? >>>>> >>>>> I was thinking of adding a vectorization_mode class that would >>>>> encapsulate the mode and whether to allow masking or alternatively >>>>> to make the vector_modes array (and the m_suggested_epilogue_mode) >>>>> a std::pair of mode and mask flag? >>>> >>>> Predicated vs. non-predicated SVE is interesting for the main loop. >>>> The class sounds like it would be useful for that. >>>> >>>> I suppose predicated vs. non-predicated SVE is also potentially >>>> interesting for an unrolled epilogue, although there, it would in >>>> theory be better to predicate only the last vector iteration >>>> (i.e. part predicated, part unpredicated). >>> >>> Yes, the latter is what we want for AVX512, keep the main loop >>> not predicated but have the epilog predicated (using the same VF). >> >> Reading it back, what I said was very ambiguous (as usual, unfortunately). >> What I actually meant was that if we had, say, a 4x unrolled main loop >> and a 2x unrolled first epilogue loop, we'd in theory want the 2x >> unrolled epilogue loop to use unpredicated operations for the first >> VF/2 elements and predicted operations for the second VF/2 elements. >> >> That way, we get the benefit of the 2x unrolling for residues of >VF >> elements, but skip to a second epilogue if there are VF or fewer >> remaining elements. >> >> That example assumes that the last quarter of each iteration of the >> main loop is predicated in a similar way, with the rest of the iteration >> being unpredicated. > > Yes, so this would work by requesting a fixed-size VF/2 first epilog > and a VF/2 fixed-size but masked second epilog. As you have distinct > modes for masked/non-masked this should already work by means of the > m_suggested_epilogue_mode field in the costs the target can set.
That sounds a bit different from what I was expecting though, in that the predicated VF/2 portion would come after the unpredicated VF/2 version, rather than be interleaved with it. In: >> Alternatively, we could have a fully-unpredicated 2x unrolled main >> loop followed by the same kind of semi-predicated 2x unrolled >> epilogue loop. >> >> So if U == unpredicated and P == predicated: >> >> main loop: U U U P >> 1st epilogue loop: U P >> 2nd epilogue loop: P >> >> 1st and 2nd epilogues might both be used >> >> or: >> >> main loop: U U >> 1st epilogue loop: U P >> 2nd epilogue loop: P >> >> 1st and 2nd epilogues are mutually exclusive ...the idea really would be to have a single 2x unrolled epilogue, interleaved in the normal way. The first instruction in each pair would be unpredicated and the second instruction would be predicated. The main loop in the second example would behave similarly. (To be clear, this isn't an objection to the patch. I'm just trying to describe the use case.) Thanks, Richard