On Wed, Sep 17, 2025 at 9:22 AM Robin Dapp <[email protected]> wrote:
>
> > We are supposed to not get into
> >
> > if (mask_element != index)
> > noop_p = false;
>
> I guess the problem is the vectype mismatch. We're checking the permutation
> for e.g. V16QI = {0, 1, 2, 3, 8, 9, 10, 11, ...} which, in isolation, is not
> a nop. That's because nelts_to_build = vf * group_size = 16.
>
> So either we need to check monotonicity etc. for each punned element later or
> we somehow need to pun earlier (as you suggested yesterday).
I don't think that would help - the issue is that the group_size is 8 but the
elements 4, 5, 6, 7 are gaps that we simply do not load. That is, the
permute code does not anticipate that we turned the contiguous load
into a strided one where we do not load a trailing gap, so effectively have
group_size == 4? That is, it's dr_group_size that is "wrong" if we want
to apply the load-permutation after our way of gathering the to be permuted
elements, as we are not building vectors that have those gaps represented
but skipped.
Of course this means the early vect_transform_slp_perm_load call computing
n_perms cannot anticipate whether we are "re-interpreting" the DR group as
strided. It also means we cannot simply perform a permutation using this
function without adjusting this. But this means we're not actually repeating_p
right now, correct?
One could add a gap_skipped parameter to the function and adjust
dr_group_size = DR_GROUP_SIZE (stmt_info);
to
dr_group_size = DR_GROUP_SIZE (stmt_info) - (gap_skipped ?
DR_GROUP_GAP (stmt_info) : 0);
but we should try to only need to compute this once (and then transform
consistently, or make sure we never need to with gap_skipped), meaning we
have to re-order
/* For single-element interleaving also fall back to elementwise
access in case we did not lower a permutation and cannot
code generate it. */
if (loop_vinfo
&& single_element_p
&& SLP_TREE_LANES (slp_node) == 1
&& (*memory_access_type == VMAT_CONTIGUOUS
|| *memory_access_type == VMAT_CONTIGUOUS_REVERSE)
&& SLP_TREE_LOAD_PERMUTATION (slp_node).exists ()
&& !perm_ok)
{
*memory_access_type = VMAT_ELEMENTWISE;
if (dump_enabled_p ())
dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
"single-element interleaving permutation not "
"supported, using elementwise access\n");
}
and the "last resort"
/* As a last resort, trying using a gather load or scatter store.
??? Although the code can handle all group sizes correctly,
it probably isn't a win to use separate strided accesses based
on nearby locations. Or, even if it's a win over scalar code,
it might not be a win over vectorizing at a lower VF, if that
allows us to use contiguous accesses. */
if (loop_vinfo
&& (*memory_access_type == VMAT_ELEMENTWISE
|| *memory_access_type == VMAT_STRIDED_SLP)
&& !STMT_VINFO_GATHER_SCATTER_P (stmt_info)
&& SLP_TREE_LANES (slp_node) == 1
&& (!SLP_TREE_LOAD_PERMUTATION (slp_node).exists ()
|| single_element_p))
{
gather_scatter_info gs_info;
if (vect_use_strided_gather_scatters_p (stmt_info, vectype, loop_vinfo,
masked_p, &gs_info, elsvals,
group_size, single_element_p))
{
SLP_TREE_GS_SCALE (slp_node) = gs_info.scale;
SLP_TREE_GS_BASE (slp_node) = error_mark_node;
ls->gs.ifn = gs_info.ifn;
ls->strided_offset_vectype = gs_info.offset_vectype;
*memory_access_type = VMAT_GATHER_SCATTER_IFN;
}
}
or better, try to make a "unified" decision here? As last resort
it would work to, instead of checking for SLP_TREE_LOAD_PERMUTATION
(slp_node).exists,
check whether the permute with gap skipped would not require any
permute or whether we can support such permute and implement
it after gathering the vector(s).
Richard.
> --
> Regards
> Robin
>