https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110625

rsandifo at gcc dot gnu.org <rsandifo at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |rsandifo at gcc dot gnu.org

--- Comment #4 from rsandifo at gcc dot gnu.org <rsandifo at gcc dot gnu.org> 
---
Sorry, didn't see this PR until now.

On:

>       general operations = 15   <-- Too large

Are you sure this is too large?  The vector code seems to be:

        ldr     q31, [x3], 16
        ldr     q29, [x4], -16
        rev64   v31.8h, v31.8h
        uxtl    v30.4s, v31.4h
        uxtl2   v31.4s, v31.8h
        sxtl    v27.2d, v30.2s
        sxtl    v28.2d, v31.2s
        sxtl2   v30.2d, v30.4s
        sxtl2   v31.2d, v31.4s
        scvtf   v27.2d, v27.2d
        scvtf   v28.2d, v28.2d
        scvtf   v30.2d, v30.2d
        scvtf   v31.2d, v31.2d
        fmla    v26.2d, v27.2d, v29.d[1]
        fmla    v24.2d, v30.2d, v29.d[1]
        fmla    v23.2d, v28.2d, v29.d[0]
        fmla    v25.2d, v31.2d, v29.d[0]

Discounting the loads, we do have 15 general operations.

On the reduction latency, the:

>      /* ??? Ideally we'd do COUNT reductions in parallel, but unfortunately
>        that's not yet the case.  */

is referring to the single_defuse_cycle code in vectorizable_reduction.  That's
always seemed like a misfeature to me, since it serialises a multi-vector
reduction through a single accumulator.  I guess it's finally time to opt out
of that for aarch64.

If we did opt out, then removing the “* count” should be correct for all cases.

Reply via email to