https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110023
--- Comment #2 from d_vampile <d_vampile at 163 dot com> --- (In reply to Andrew Pinski from comment #1) > This is almost definitely an aarch64 cost model issue ... Do you mean that the vectorized cost_model of the underlying hardware causes the policy of not peeling the loop after r247544 to be chosen? ? So why does loop peeling result in performance improvements? For the following code, I understand that this is a very standard vectorized effective loop. for (j=0; j<STREAM_ARRAY_SIZE; j++) c[j] = a[j]+b[j];