https://gcc.gnu.org/bugzilla/show_bug.cgi?id=68707
--- Comment #5 from Richard Biener <rguenth at gcc dot gnu.org> --- Created attachment 36951 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=36951&action=edit patch for testing Can ARM people please evaluate the attached? It simply prefers load/store-lane over SLP. I'd like to know whether there are cases this is undesirable and whether this patch causes some loops not to be vectorized at all (because I got the load/store-lane supported test wrong). Caveats may be that SLP may require no unrolling and load/store-lane always does and thus with a statically known loop trip count the vectorization would not be done with load/store-lanes. Likewise the minimum required iterations for the not-known case may cause the vectorized variant be skipped always if the loop trip count is small in practice. Likewise the extra peeling required for gaps may have the same effect (though with gaps the SLP variant will always require eventually expensive permutes). Thus caveats may apply mainly for low loop iteration counts (only decidable at runtime in most cases). The patch is a heuristic, possible improvements include looking at a statically known loop trip count as well as at the actual permutation required for SLP (may be none). In the context of ARM load/store-lane I know nothing about costs. Eventually we should do the same for cases that regular interleaving can handle if SLP requires permutations.