https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110625
rsandifo at gcc dot gnu.org <rsandifo at gcc dot gnu.org> changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |rsandifo at gcc dot gnu.org --- Comment #4 from rsandifo at gcc dot gnu.org <rsandifo at gcc dot gnu.org> --- Sorry, didn't see this PR until now. On: > general operations = 15 <-- Too large Are you sure this is too large? The vector code seems to be: ldr q31, [x3], 16 ldr q29, [x4], -16 rev64 v31.8h, v31.8h uxtl v30.4s, v31.4h uxtl2 v31.4s, v31.8h sxtl v27.2d, v30.2s sxtl v28.2d, v31.2s sxtl2 v30.2d, v30.4s sxtl2 v31.2d, v31.4s scvtf v27.2d, v27.2d scvtf v28.2d, v28.2d scvtf v30.2d, v30.2d scvtf v31.2d, v31.2d fmla v26.2d, v27.2d, v29.d[1] fmla v24.2d, v30.2d, v29.d[1] fmla v23.2d, v28.2d, v29.d[0] fmla v25.2d, v31.2d, v29.d[0] Discounting the loads, we do have 15 general operations. On the reduction latency, the: > /* ??? Ideally we'd do COUNT reductions in parallel, but unfortunately > that's not yet the case. */ is referring to the single_defuse_cycle code in vectorizable_reduction. That's always seemed like a misfeature to me, since it serialises a multi-vector reduction through a single accumulator. I guess it's finally time to opt out of that for aarch64. If we did opt out, then removing the “* count” should be correct for all cases.