https://gcc.gnu.org/bugzilla/show_bug.cgi?id=117606
Bug ID: 117606 Summary: single element interleaving behavior for SLP does not exactly match non-SLP behavior Product: gcc Version: 15.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: tree-optimization Assignee: unassigned at gcc dot gnu.org Reporter: rguenth at gcc dot gnu.org Target Milestone: --- For the following, derived from gcc.target/aarch64/sve/strided_load_5.c void foo (unsigned *restrict dest, unsigned *src, int n) { for (int i = 0; i < n; ++i) dest[i] += src[i * 5]; } when not using SLP we classify the load from src[] as VMAT_ELEMENTWISE and thus consider using gathers. This is because we do not have general permute support there but only load-lanes and the static set of "group loads" and both vect_load_lanes_supported and vect_grouped_load_supported FAIL (on aarch64, riscv supports ld5). With SLP this gets classified as (permuted) VMAT_CONTIGUOUS since load-lanes isn't supported. We do not attempt to lower the permutation to something supported and thus run into the 3 vector limit. The fallback to VMAT_ELEMENTWISE is only for groups larger than the vector size so it doesn't apply here. One possible approach would be to check for the permute to be supported during VMAT classification rather than after the fact in vectorizable_load/store. That matches how we verify vect_grouped_load_supported.