On Fri, Jul 25, 2025 at 10:32 PM Robin Dapp <rdapp....@gmail.com> wrote: > > > That would definitely be nice to have for both gather and stride loads > > I'm not sure I like the direction that's heading ;) > > So the loop I'm targeting is x264's satd: > > for( int i = 0; i < 4; i++, pix1 += i_pix1, pix2 += i_pix2 ) > { > a0 = (pix1[0] - pix2[0])... > a1 = (pix1[1] - pix2[1])... > a2 = (pix1[2] - pix2[2])... > a3 = (pix1[3] - pix2[3])... > > where DR_STEP is known but non-constant, so STMT_VINFO_STRIDED_P = true. > > Right now we always set VMAT_STRIDED_SLP when STMT_VINFO_STRIDED_P.
Yeah, VMAT_STRIDED_SLP is what VMAT_ELEMENTWISE was to non-SLP, though how we emit the contiguous part of the SLP group depends and it could be elementwise as fallback. > > For the single-element case (and only for that one AFAICT) we can switch to > VMAT_GATHER_SCATTER. Is the idea to relax that and also allow "strided" > gather/scatter for larger groups, involving composition types in particular? > Or maybe I missed the point. So the idea would be to, for the loop example above where IIRC pix1/pix2 are char, either emit a gather to VnSI combining four consecutive QImode loads into one SImode and then view-converting the result back to a VmQImode vector. That also simplifies the offset vector calculation. The "fallback" (consider non-power-of-two group size) would of course be to gather the VmQImode vector directly, and have an offset vector of { 0, 1, 2, 3, stride, stride+1, stride+2, stride+3, ... } > One complication with that is that generic gather/scatter on riscv is also > pretty slow, not sure if it's as bad as on x86 but certainly only rarely a > win. > > At least right now I'm having a hard time imagining which strategy will be > faster and I'd be more comfortable with a costing decision rather than a > static > switch. And we don't compare costs for different strategies but just choose > one for a specific mode. Of course, in the end VMAT_STRIDED_SLP usually > performs scalar loads in order to construct a vector but vector-vector loads > and construction is also possible. Maybe that's better than > gather/scatter/strided. I would need to compare a few cases for real to get > a better feeling of it. I think it might be possible that refactoring how we do VMAT_STRIDED_SLP vs VMAT_GATHER/SCATTER, at least and possibly specifically for the case of emulated handling would be a good thing. But it'll require experiments and see how it all fits together. My current priority is to sort out the analysis-vs-transform "split" and storing more data from analysis into the SLP node for both load/store and reductions so that data is also more easily accessible from the cost models. Richard. > > If I didn't miss the point I could give it a shot. Maybe my complicated > dynamic-dispatch scheme for groups larger than the largest vector unit would > fit in there as well then (PR120639). > > -- > Regards > Robin >