https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92244

Andrew Pinski <pinskia at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
           Severity|normal                      |enhancement

--- Comment #3 from Andrew Pinski <pinskia at gcc dot gnu.org> ---
(In reply to Peter Cordes from comment #1)
> On AArch64 (with gcc8.2), we see a similar effect, more instructions in the
> loop.  And an indexed addressing mode.

With the trunk (with generic tuning):
.L4:
        ldr     q1, [x3, x2]
        ldr     q0, [x4]
        tbl     v1.16b, {v1.16b}, v2.16b
        tbl     v0.16b, {v0.16b}, v2.16b
        str     q1, [x4], 16
        str     q0, [x3, x2]
        sub     x2, x2, #16
        cmp     x2, x1

With -mcpu=octeontx:
.L6:
        ldr     q1, [x0, x2]
        ldr     q0, [x3, x1]
        tbl     v1.16b, {v1.16b}, v2.16b
        tbl     v0.16b, {v0.16b}, v2.16b
        str     q1, [x3, x1]
        add     x1, x1, 16
        str     q0, [x0, x2]
        sub     x2, x2, #16
        cmp     x1, x5
        bne     .L6

With -mcpu=thunderx2t99:
.L4:
        ldr     q1, [x3], -16
        ldr     q0, [x2]
        tbl     v1.16b, {v1.16b}, v2.16b
        tbl     v0.16b, {v0.16b}, v2.16b
        str     q1, [x2], 16
        str     q0, [x1], -16
        cmp     x2, x5
        bne     .L4

I am not shocked that IV-OPTS can chose these widly differences.
I have not looked at the cost differences to understand why -mcpu=thunderx2t99
chose what close might be the best (we could use one less IV by replacing the
first ldr by using the same IV as the last str).

Reply via email to