https://gcc.gnu.org/bugzilla/show_bug.cgi?id=18438

--- Comment #12 from Maxim Kuvyrkov <mkuvyrkov at gcc dot gnu.org> ---
(In reply to Andrew Pinski from comment #11)
> (In reply to Maxim Kuvyrkov from comment #9)
> > which then becomes for aarch64:
> > .L4:
> >     ld2     {v0.2d - v1.2d}, [x1]
> >     add     w2, w2, 1
> >     cmp     w2, w7
> >     eor     v0.16b, v2.16b, v0.16b
> >     umov    x4, v0.d[1]
> >     st1     {v0.d}[0], [x1]
> >     add     x1, x1, 32
> >     str     x4, [x1, -16]
> >     bcc     .L4
> 
> 
> What I did for thunderx was create a vector cost model which caused this
> loop not be vectorized to get the regression from happening.  Not this might
> actually be better code for some micro arch. I need to check with the new
> processor we have in house but that is next week or so.  I don't know how
> much I can share next week though.

You are making an orthogonal point to this bug report: whether or not to
vectorize such a loop.  But if loop is vectorized, then on any
microarchitecture it is better to have "st2" vs "umov; st1; str".

Reply via email to