Yeah, VMAT_STRIDED_SLP is what VMAT_ELEMENTWISE was to non-SLP,
though how we emit the contiguous part of the SLP group depends and it could
be elementwise as fallback.


For the single-element case (and only for that one AFAICT) we can switch to
VMAT_GATHER_SCATTER.  Is the idea to relax that and also allow "strided"
gather/scatter for larger groups, involving composition types in particular?
Or maybe I missed the point.

So the idea would be to, for the loop example above where IIRC
pix1/pix2 are char,
either emit a gather to VnSI combining four consecutive QImode loads into one
SImode and then view-converting the result back to a VmQImode vector.  That
also simplifies the offset vector calculation.  The "fallback"
(consider non-power-of-two
group size) would of course be to gather the VmQImode vector directly,
and have an offset vector of { 0, 1, 2, 3, stride, stride+1, stride+2,
stride+3, ... }

Ok good then we're aligned because that's mostly what I already did (though obviously inside VMAT_STRIDED_SLP and just for strided load, not more generically as we want here).

I think it might be possible that refactoring how we do VMAT_STRIDED_SLP
vs VMAT_GATHER/SCATTER, at least and possibly specifically for the
case of emulated handling would be a good thing.  But it'll require experiments
and see how it all fits together.

Yes, I'll try to play around and try some re-ordering and fallbacks. My main concern is choosing a more expensive load (gather) and not being able to fall back to a lighter-weight vector-vector composition scheme. But maybe we can probe in advance if such a scheme is available and how vector size trade-offs etc. are.

My current priority is to sort out the analysis-vs-transform "split" and storing
more data from analysis into the SLP node for both load/store and reductions
so that data is also more easily accessible from the cost models.

One thing I remember I wanted to fix is adjusting the vector size for costing after the view-converting a vector (that we're already doing in VMAT_STRIDED_SLP). For our microarchitecture the number of elements is crucial in costing a gather/scatter/strided.

--
Regards
Robin

Reply via email to