On Fri, Jul 25, 2025 at 10:32 PM Robin Dapp <rdapp....@gmail.com> wrote:
>
> > That would definitely be nice to have for both gather and stride loads
>
> I'm not sure I like the direction that's heading ;)
>
> So the loop I'm targeting is x264's satd:
>
>     for( int i = 0; i < 4; i++, pix1 += i_pix1, pix2 += i_pix2 )
>     {
>         a0 = (pix1[0] - pix2[0])...
>         a1 = (pix1[1] - pix2[1])...
>         a2 = (pix1[2] - pix2[2])...
>         a3 = (pix1[3] - pix2[3])...
>
> where DR_STEP is known but non-constant, so STMT_VINFO_STRIDED_P = true.
>
> Right now we always set VMAT_STRIDED_SLP when STMT_VINFO_STRIDED_P.

Yeah, VMAT_STRIDED_SLP is what VMAT_ELEMENTWISE was to non-SLP,
though how we emit the contiguous part of the SLP group depends and it could
be elementwise as fallback.

>
> For the single-element case (and only for that one AFAICT) we can switch to
> VMAT_GATHER_SCATTER.  Is the idea to relax that and also allow "strided"
> gather/scatter for larger groups, involving composition types in particular?
> Or maybe I missed the point.

So the idea would be to, for the loop example above where IIRC
pix1/pix2 are char,
either emit a gather to VnSI combining four consecutive QImode loads into one
SImode and then view-converting the result back to a VmQImode vector.  That
also simplifies the offset vector calculation.  The "fallback"
(consider non-power-of-two
group size) would of course be to gather the VmQImode vector directly,
and have an offset vector of { 0, 1, 2, 3, stride, stride+1, stride+2,
stride+3, ... }

> One complication with that is that generic gather/scatter on riscv is also
> pretty slow, not sure if it's as bad as on x86 but certainly only rarely a 
> win.
>
> At least right now I'm having a hard time imagining which strategy will be
> faster and I'd be more comfortable with a costing decision rather than a 
> static
> switch.  And we don't compare costs for different strategies but just choose
> one for a specific mode.  Of course, in the end VMAT_STRIDED_SLP usually
> performs scalar loads in order to construct a vector but vector-vector loads
> and construction is also possible.  Maybe that's better than
> gather/scatter/strided.  I would need to compare a few cases for real to get
> a better feeling of it.

I think it might be possible that refactoring how we do VMAT_STRIDED_SLP
vs VMAT_GATHER/SCATTER, at least and possibly specifically for the
case of emulated handling would be a good thing.  But it'll require experiments
and see how it all fits together.

My current priority is to sort out the analysis-vs-transform "split" and storing
more data from analysis into the SLP node for both load/store and reductions
so that data is also more easily accessible from the cost models.

Richard.

>
> If I didn't miss the point I could give it a shot.  Maybe my complicated
> dynamic-dispatch scheme for groups larger than the largest vector unit would
> fit in there as well then (PR120639).
>
> --
> Regards
>  Robin
>

Reply via email to