https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110625

--- Comment #6 from Hao Liu <hliu at amperecomputing dot com> ---
Thanks for the confirmation about the reduction latency.  I'll create a simple
patch to fix this.

> Discounting the loads, we do have 15 general operations.

That's true, and there are indeed 8 general operations for scalar loop.  As the
count_ops() is accurate, it seems maybe the Cost of Vector Body is too large
(Vector inside of loop cost: 51):

    *k_48 4 times vec_perm costs 12 in body
    *k_48 1 times unaligned_load (misalign -1) costs 4 in body
    _5->m1 1 times vec_perm costs 3 in body
    _5->m4 1 times unaligned_load (misalign -1) costs 4 in body
    (int) _24 2 times vec_promote_demote costs 4 in body
    (double) _25 4 times vec_promote_demote costs 8 in body
    _2 * _26 4 times vector_stmt costs 8 in body

If it is small enough, even the vect-body cost is increased according to the
issue-info, SLP is still profitable.  I'm not quite familiar with this part and
it may affect all aarch64 targets, so I think it's hard to fix by me.  It would
be great if you will look at how to fix this.

Reply via email to