https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110023

--- Comment #2 from d_vampile <d_vampile at 163 dot com> ---
(In reply to Andrew Pinski from comment #1)
> This is almost definitely an aarch64 cost model issue ...

Do you mean that the vectorized cost_model of the underlying hardware causes
the policy of not peeling the loop after r247544 to be chosen? ? So why does
loop peeling result in performance improvements?
For the following code, I understand that this is a very standard vectorized
effective loop.
for (j=0; j<STREAM_ARRAY_SIZE; j++)
c[j] = a[j]+b[j];

Reply via email to