https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108583

--- Comment #30 from CVS Commits <cvs-commit at gcc dot gnu.org> ---
The master branch has been updated by Tamar Christina <tnfch...@gcc.gnu.org>:

https://gcc.gnu.org/g:f23dc726875c26f2c38dfded453aa9beba0b9be9

commit r13-6620-gf23dc726875c26f2c38dfded453aa9beba0b9be9
Author: Tamar Christina <tamar.christ...@arm.com>
Date:   Sun Mar 12 18:42:59 2023 +0000

    AArch64: Update div-bitmask to implement new optab instead of target hook
[PR108583]

    This replaces the custom division hook with just an implementation through
    add_highpart.  For NEON we implement the add highpart (Addition +
extraction of
    the upper highpart of the register in the same precision) as ADD + LSR.

    This representation allows us to easily optimize the sequence using
existing
    sequences. This gets us a pretty decent sequence using SRA:

            umull   v1.8h, v0.8b, v3.8b
            umull2  v0.8h, v0.16b, v3.16b
            add     v5.8h, v1.8h, v2.8h
            add     v4.8h, v0.8h, v2.8h
            usra    v1.8h, v5.8h, 8
            usra    v0.8h, v4.8h, 8
            uzp2    v1.16b, v1.16b, v0.16b

    To get the most optimal sequence however we match (a + ((b + c) >> n))
where n
    is half the precision of the mode of the operation into addhn + uaddw which
is
    a general good optimization on its own and gets us back to:

    .L4:
            ldr     q0, [x3]
            umull   v1.8h, v0.8b, v5.8b
            umull2  v0.8h, v0.16b, v5.16b
            addhn   v3.8b, v1.8h, v4.8h
            addhn   v2.8b, v0.8h, v4.8h
            uaddw   v1.8h, v1.8h, v3.8b
            uaddw   v0.8h, v0.8h, v2.8b
            uzp2    v1.16b, v1.16b, v0.16b
            str     q1, [x3], 16
            cmp     x3, x4
            bne     .L4

    For SVE2 we optimize the initial sequence to the same ADD + LSR which gets
us:

    .L3:
            ld1b    z0.h, p0/z, [x0, x3]
            mul     z0.h, p1/m, z0.h, z2.h
            add     z1.h, z0.h, z3.h
            usra    z0.h, z1.h, #8
            lsr     z0.h, z0.h, #8
            st1b    z0.h, p0, [x0, x3]
            inch    x3
            whilelo p0.h, w3, w2
            b.any   .L3
    .L1:
            ret

    and to get the most optimal sequence I match (a + b) >> n (same constraint
on n)
    to addhnb which gets us to:

    .L3:
            ld1b    z0.h, p0/z, [x0, x3]
            mul     z0.h, p1/m, z0.h, z2.h
            addhnb  z1.b, z0.h, z3.h
            addhnb  z0.b, z0.h, z1.h
            st1b    z0.h, p0, [x0, x3]
            inch    x3
            whilelo p0.h, w3, w2
            b.any   .L3

    There are multiple RTL representations possible for these optimizations, I
did
    not represent them using a zero_extend because we seem very inconsistent in
this
    in the backend.  Since they are unspecs we won't match them from vector ops
    anyway. I figured maintainers would prefer this, but my maintainer ouija
board
    is still out for repairs :)

    There are no new test as new correctness tests were added to the mid-end
and
    the existing codegen tests for this already exist.

    gcc/ChangeLog:

            PR target/108583
            * config/aarch64/aarch64-simd.md (@aarch64_bitmask_udiv<mode>3):
Remove.
            (*bitmask_shift_plus<mode>): New.
            * config/aarch64/aarch64-sve2.md (*bitmask_shift_plus<mode>): New.
            (@aarch64_bitmask_udiv<mode>3): Remove.
            * config/aarch64/aarch64.cc
            (aarch64_vectorize_can_special_div_by_constant,
            TARGET_VECTORIZE_CAN_SPECIAL_DIV_BY_CONST): Removed.
            (TARGET_VECTORIZE_PREFERRED_DIV_AS_SHIFTS_OVER_MULT,
            aarch64_vectorize_preferred_div_as_shifts_over_mult): New.

Reply via email to