https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91735

--- Comment #3 from Richard Biener <rguenth at gcc dot gnu.org> ---
Reducing the VF here should be the goal.  For the particular case "filling"
the holes with neutral data and blending in the original values at store time
will likely be optimal.  So do

  tem = vector load
  zero all [4] elements
  compute
  blend in 'tem' into the [4] elements
  vector store

eliding all the shuffling/striding.  Should end up at a VF of 4 (SSE) or 8
(AVX).

Doesn't fit very well into the current vectorizer architecture.

So currently we can only address this from the costing side.

arm can probably leverage load/store-lanes here.

With char elements and an SLP size of 3 it's probably the worst case we can
think of.

Reply via email to