https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110625
--- Comment #6 from Hao Liu <hliu at amperecomputing dot com> --- Thanks for the confirmation about the reduction latency. I'll create a simple patch to fix this. > Discounting the loads, we do have 15 general operations. That's true, and there are indeed 8 general operations for scalar loop. As the count_ops() is accurate, it seems maybe the Cost of Vector Body is too large (Vector inside of loop cost: 51): *k_48 4 times vec_perm costs 12 in body *k_48 1 times unaligned_load (misalign -1) costs 4 in body _5->m1 1 times vec_perm costs 3 in body _5->m4 1 times unaligned_load (misalign -1) costs 4 in body (int) _24 2 times vec_promote_demote costs 4 in body (double) _25 4 times vec_promote_demote costs 8 in body _2 * _26 4 times vector_stmt costs 8 in body If it is small enough, even the vect-body cost is increased according to the issue-info, SLP is still profitable. I'm not quite familiar with this part and it may affect all aarch64 targets, so I think it's hard to fix by me. It would be great if you will look at how to fix this.