https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116973
--- Comment #2 from Richard Biener <rguenth at gcc dot gnu.org> --- Note the load in question isn't lowered because of /* When the load permutation accesses a contiguous unpermuted, power-of-two aligned and sized chunk leave the load alone. We can likely (re-)load it more efficiently rather than extracting it from the larger load. ??? Long-term some of the lowering should move to where the vector types involved are fixed. */ if (ld_lanes_lanes == 0 && contiguous && (SLP_TREE_LANES (load) > 1 || loads.size () == 1) && pow2p_hwi (SLP_TREE_LANES (load)) && SLP_TREE_LOAD_PERMUTATION (load)[0] % SLP_TREE_LANES (load) == 0 && group_lanes % SLP_TREE_LANES (load) == 0) { final_perm.release (); continue; } which I added as part of r15-3442-g7164d982663738 that enables lowering of single loads which specifically exempted some cases to avoid regressions. gcc.dg/vect/slp-gap-1.c is the testcase that benefits from the above. That case asks for lowering being more aware of gaps I guess, while we now track those with NULL scalar stmt in the lanes the lowering code doesn't track the "do-not-care" state of such lanes but it probably should.