On Tue, Sep 16, 2025 at 3:07 PM Robin Dapp <[email protected]> wrote:
>
> > I think this now conflicts a bit with what I just pushed (sorry).
> >
> >> && loop_vinfo)
> >> {
> >> + unsigned i, j;
> >> + bool simple_perm_series = true;
> >> + FOR_EACH_VEC_ELT (SLP_TREE_LOAD_PERMUTATION (slp_node), i, j)
> >> + if (i != j)
> >> + simple_perm_series = false;
> >
> > In particular I disallow all load permutes, since we are going to elide it.
>
> The load permutation in the x264 case is simple, {0, 1, 2, 3}, {4, 5, 6, 7},
> etc. If we're disallowing these as well here there is no point in having the
> new function ;)
>
> So we're not allowing load permutations here because we would elide (=not
> generate code for) them via
> > /* ??? The following checks should really be part of
> > get_load_store_type. */
> > if (SLP_TREE_LOAD_PERMUTATION (slp_node).exists ()
> even though perm_ok = true?
Yes.
> Isn't this all single-element, single-lane anyway?
Well, what you want to catch now isn't single-lane anymore. But I guess since
we now check the permute before this we can rely on check for n_perms == 0
to catch the "no actual permutation required" case?
> Can I re-allow simple permutations (as they are nops anyway)? Or rather add a
> vect_transform_slp_perm_load in the mat_gather_scatter_p part of
> vectorizable_load?
So can you pass down n_perms and check that instead of your loop over
the permutation?
> The case I was trying to avoid (for now) when disabling load permutations
> is
>
> for (int i = 0; i < 512; ++i)
> {
> x[2*i] = y[1023 - (2*i)];
> x[2*i+1] = y[1023 - (2*i+1)];
> }
>
> (load permutation {1, 0}).
Yeah, given we do not apply permutations after gathering the vector (because,
with gathers, there's never any load permutation ... until now).
> > Re-doing this after the above is a bit ugly. Likewise having pun_vectype.
> > It's also an opportunity to move the VMAT_STRIDED_SLP punning and
> > alignment (re-)computation code here. OTOH this is complicated enough
> > already.
>
> I can try but I'm not sure it will be any less complex.
>
> > I believe you now have to change
> >
> > /* ??? The following checks should really be part of
> > get_load_store_type. */
> > if (SLP_TREE_LOAD_PERMUTATION (slp_node).exists ()
> > && !((memory_access_type == VMAT_ELEMENTWISE
> > || mat_gather_scatter_p (memory_access_type))
> > && SLP_TREE_LANES (slp_node) == 1
> > && (!grouped_load
> > || !DR_GROUP_NEXT_ELEMENT (first_stmt_info))))
> > {
> >
> > as otherwise you'll get slp_perm set but the permute will be silently
> > elided. This is all a bit ugly already :/
>
> So we're setting slp_perm but know that we don't act on it in
> VMAT_ELEMENTWISE?
So for VMAT_ELEMENTWISE we _do_ apply a permutation (not sure that works
though, too many guard rails around VMAT_ELEMENTWISE). But when we do
not run into the above body we have slp_perm = false and skip applying a
permute. The above tries to catch the old strided load case and the
VMAT_ELEMENTWISE
we fall back to when we cannot materialize the permute.
I don't like very much how this is spread over multiple places ...
Ideally VMAT_ELEMENTWISE would support all permutes by means of
directly honoring it during vector construction, so it shouldn't matter whether
vect_transform_slp_perm_load would succeed ot fail. But that's not done yet.
I think I'd want to split VMAT_ELEMENTWISE from VMAT_STRIDED_SLP for
this, but until I'll try I don't know whether I'll like that.
Given we now have gathers that can have a permutation - like the
following:
> >
> > Now I wonder if/how we handle
> >
> > for (i = 0; i < n; ++i)
> > {
> > int j = offset[i];
> > sum += data[2*i] + data[2*i+1];
> > }
could be such one, it probably makes sense to handle load-permutation
materialization after the gather operation. OTOH I'd like to get rid
of load-permutations, but ...
> > aka a gather grouped load. Maybe there's an opportunity to handle
> > the "punning" higher up, representing this as a single-element group
> > in the first place. Hmm.
> >
> > Anyway, I think the general direction of the patch is OK. You'll have to
> > figure what I just broke though and I'm still somewhat missing the
> > "beauty" moment when thinking of how the VMAT_STRIDED_SLP
> >
> > /* ??? Modify local copies of alignment_support_scheme and
> > misalignment, but this part of analysis should be done
> > earlier and remembered, likewise the chosen load mode. */
> >
> > parts resolve themselves into happiness with this ... ;)
>
> I'm not sure I'm capable of producing a beauty moment in the vectorizer ;)
Heh, I wasn't expecting you to produce beauty - it's not that I don't fail there
as well :/
> but I'll try how ugly it gets when consolidating the type punning.
Yep, I was just thinking that as it's similar (and we've come from doing
the strided load in that code path to the gather one), we might put up
something re-usable.
Richard.
>
> --
> Regards
> Robin
>