https://gcc.gnu.org/bugzilla/show_bug.cgi?id=18438
--- Comment #14 from Andrew Pinski <pinskia at gcc dot gnu.org> --- (In reply to Maxim Kuvyrkov from comment #12) > You are making an orthogonal point to this bug report: whether or not to > vectorize such a loop. But if loop is vectorized, then on any > microarchitecture it is better to have "st2" vs "umov; st1; str". Yes but thinking about the problem some more I do think there are some vector cost model issue in the aarch64 backend where we don't model int vs floating point cost differences. For an example ^ for scalar int might be one cycle but vector it is 4 cycles but for floating point scalar addition, it is 4 cycles while the floating point vector addition is just 4 cycles. struct cpu_vector_cost { const int scalar_stmt_cost; /* Cost of any scalar operation, excluding load and store. */ ... const int vec_stmt_cost; /* Cost of any vector operation, excluding load, store, permute, vector-to-scalar and scalar-to-vector operation. */ Anyways I filed PR 79262 for the regression.