https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93771
--- Comment #4 from rguenther at suse dot de <rguenther at suse dot de> --- On Mon, 17 Feb 2020, pinskia at gcc dot gnu.org wrote: > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93771 > > --- Comment #3 from Andrew Pinski <pinskia at gcc dot gnu.org> --- > (In reply to Richard Biener from comment #2) > > Confirmed. I'm not sure if we should try to "fix" SLP here or rather > > appropriately optimize > > > > v2df tem1 = *(v2df *)&t[0]; > > v2df tem2 = *(v2df *)&t[2]; > > __builtin_shuffle (tem1, tem2 (v2di) { 0, 3 }); > > > > which the user could write itself. forwprop does some related transforms > > splitting loads in "Rewrite loads used only in BIT_FIELD_REF extractions to > > component-wise loads." > > I was thinking about originally filing the bug that way but I decided against > it; though I don't remember my reasoning besides I saw SLP not doing it for > unrelated loads. The vectorizer sees the two loads from t as grouped at a time it doesn't yet know the vectorization factor. General handling of non-contiguous loads then emits the permutation. Structure of that code makes it quite hard to do what you desire and changing the decision of whether it's a group or not "late" is also going to hurt. There are pending changes (in my mind only ... :/) that would make such a change much more straight-forward of course. So I think an ad-hoc solution in forwprop is better for now.