Yeah, VMAT_STRIDED_SLP is what VMAT_ELEMENTWISE was to non-SLP,
though how we emit the contiguous part of the SLP group depends and it could
be elementwise as fallback.
For the single-element case (and only for that one AFAICT) we can switch to
VMAT_GATHER_SCATTER. Is the idea to relax that and also allow "strided"
gather/scatter for larger groups, involving composition types in particular?
Or maybe I missed the point.
So the idea would be to, for the loop example above where IIRC
pix1/pix2 are char,
either emit a gather to VnSI combining four consecutive QImode loads into one
SImode and then view-converting the result back to a VmQImode vector. That
also simplifies the offset vector calculation. The "fallback"
(consider non-power-of-two
group size) would of course be to gather the VmQImode vector directly,
and have an offset vector of { 0, 1, 2, 3, stride, stride+1, stride+2,
stride+3, ... }
Ok good then we're aligned because that's mostly what I already did (though
obviously inside VMAT_STRIDED_SLP and just for strided load, not more
generically as we want here).
I think it might be possible that refactoring how we do VMAT_STRIDED_SLP
vs VMAT_GATHER/SCATTER, at least and possibly specifically for the
case of emulated handling would be a good thing. But it'll require experiments
and see how it all fits together.
Yes, I'll try to play around and try some re-ordering and fallbacks. My main
concern is choosing a more expensive load (gather) and not being able to fall
back to a lighter-weight vector-vector composition scheme. But maybe we can
probe in advance if such a scheme is available and how vector size trade-offs
etc. are.
My current priority is to sort out the analysis-vs-transform "split" and storing
more data from analysis into the SLP node for both load/store and reductions
so that data is also more easily accessible from the cost models.
One thing I remember I wanted to fix is adjusting the vector size for costing
after the view-converting a vector (that we're already doing in
VMAT_STRIDED_SLP). For our microarchitecture the number of elements is crucial
in costing a gather/scatter/strided.
--
Regards
Robin