[Bug tree-optimization/92244] vectorized loop updating 2 copies of the same pointer (for in-place reversal cross in the middle)

pinskia at gcc dot gnu.org Sun, 27 Oct 2019 17:12:49 -0700

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92244


Andrew Pinski <pinskia at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
           Severity|normal                      |enhancement

--- Comment #3 from Andrew Pinski <pinskia at gcc dot gnu.org> ---
(In reply to Peter Cordes from comment #1)
> On AArch64 (with gcc8.2), we see a similar effect, more instructions in the
> loop.  And an indexed addressing mode.

With the trunk (with generic tuning):
.L4:
        ldr     q1, [x3, x2]
        ldr     q0, [x4]
        tbl     v1.16b, {v1.16b}, v2.16b
        tbl     v0.16b, {v0.16b}, v2.16b
        str     q1, [x4], 16
        str     q0, [x3, x2]
        sub     x2, x2, #16
        cmp     x2, x1

With -mcpu=octeontx:
.L6:
        ldr     q1, [x0, x2]
        ldr     q0, [x3, x1]
        tbl     v1.16b, {v1.16b}, v2.16b
        tbl     v0.16b, {v0.16b}, v2.16b
        str     q1, [x3, x1]
        add     x1, x1, 16
        str     q0, [x0, x2]
        sub     x2, x2, #16
        cmp     x1, x5
        bne     .L6

With -mcpu=thunderx2t99:
.L4:
        ldr     q1, [x3], -16
        ldr     q0, [x2]
        tbl     v1.16b, {v1.16b}, v2.16b
        tbl     v0.16b, {v0.16b}, v2.16b
        str     q1, [x2], 16
        str     q0, [x1], -16
        cmp     x2, x5
        bne     .L4

I am not shocked that IV-OPTS can chose these widly differences.
I have not looked at the cost differences to understand why -mcpu=thunderx2t99
chose what close might be the best (we could use one less IV by replacing the
first ldr by using the same IV as the last str).

[Bug tree-optimization/92244] vectorized loop updating 2 copies of the same pointer (for in-place reversal cross in the middle)

Reply via email to