https://gcc.gnu.org/bugzilla/show_bug.cgi?id=18438
--- Comment #12 from Maxim Kuvyrkov <mkuvyrkov at gcc dot gnu.org> --- (In reply to Andrew Pinski from comment #11) > (In reply to Maxim Kuvyrkov from comment #9) > > which then becomes for aarch64: > > .L4: > > ld2 {v0.2d - v1.2d}, [x1] > > add w2, w2, 1 > > cmp w2, w7 > > eor v0.16b, v2.16b, v0.16b > > umov x4, v0.d[1] > > st1 {v0.d}[0], [x1] > > add x1, x1, 32 > > str x4, [x1, -16] > > bcc .L4 > > > What I did for thunderx was create a vector cost model which caused this > loop not be vectorized to get the regression from happening. Not this might > actually be better code for some micro arch. I need to check with the new > processor we have in house but that is next week or so. I don't know how > much I can share next week though. You are making an orthogonal point to this bug report: whether or not to vectorize such a loop. But if loop is vectorized, then on any microarchitecture it is better to have "st2" vs "umov; st1; str".