Richard Biener <rguent...@suse.de> writes: > On Tue, 27 Oct 2020, Richard Sandiford wrote: > >> Sorry for the very late comment (was out last week)? >> >> Richard Biener <rguent...@suse.de> writes: >> > This enables SLP store group splitting also for loop vectorization. >> > For the existing testcase gcc.dg/vect/vect-complex-5.c this then >> > generates much better code, likewise for the PR97428 testcase. >> > >> > Both of those have a splitting opportunity splitting the group >> > into two equal (vector-sized) halves, still the patch enables >> > quite arbitrary splitting since generally the interleaving scheme >> > results in quite awkward code for even small groups. If any >> > problems surface with this it's easy to restrict the splitting >> > to known-good cases. Is there any additional constraints for >> > non-constant sized vectors? Note this interacts with vector >> > size iteration (but comparing interleaving cost with SLP cost >> > of a smaller vector size doesn't reliably pick the smaller >> > vector size). >> >> Not sure about the variable-sized vector aspect. For SVE it >> isn't really natural to split the store itself up: I think we'd >> instead want to keep a unified store and blend in the stored >> values where necessary. E.g. rather than split: >> >> a a a a b b c c >> >> into: >> >> a a a a >> b b >> c c >> >> we'd be better off having predicated groups of the form: >> >> a a a a _ _ _ _ >> _ _ _ _ b b _ _ >> _ _ _ _ _ _ c c >> >> This is one thing on the very long todo list :-/ > > Hmm, I see. Looking at the case of a group_size == 3 store > right now which (for the sake of register pressure) would > benefit from V4xy vectorization and a masked store, doing > sth "smart" to fill up lane 4 (duplicating another one > would always work but possibly make loads more expensive, > masking would work here as well).
Yeah. Also, SVE has an instruction that fills up a predicate up to the largest multiple of 3. So for a group size of 3 we could do something like: ptrue p0.b, mul3 ld1b z0.b, p0/z, ... ... st1b z0.b, p0, ... For the final (possibly partial) iteration we'd just use WHILELO as normal, knowing that nscalars * 3 fits into a vector. Yet another thing on the to-do list :-) Thanks, Richard