https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91735
--- Comment #3 from Richard Biener <rguenth at gcc dot gnu.org> --- Reducing the VF here should be the goal. For the particular case "filling" the holes with neutral data and blending in the original values at store time will likely be optimal. So do tem = vector load zero all [4] elements compute blend in 'tem' into the [4] elements vector store eliding all the shuffling/striding. Should end up at a VF of 4 (SSE) or 8 (AVX). Doesn't fit very well into the current vectorizer architecture. So currently we can only address this from the costing side. arm can probably leverage load/store-lanes here. With char elements and an SLP size of 3 it's probably the worst case we can think of.