https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92244
Andrew Pinski <pinskia at gcc dot gnu.org> changed: What |Removed |Added ---------------------------------------------------------------------------- Severity|normal |enhancement --- Comment #3 from Andrew Pinski <pinskia at gcc dot gnu.org> --- (In reply to Peter Cordes from comment #1) > On AArch64 (with gcc8.2), we see a similar effect, more instructions in the > loop. And an indexed addressing mode. With the trunk (with generic tuning): .L4: ldr q1, [x3, x2] ldr q0, [x4] tbl v1.16b, {v1.16b}, v2.16b tbl v0.16b, {v0.16b}, v2.16b str q1, [x4], 16 str q0, [x3, x2] sub x2, x2, #16 cmp x2, x1 With -mcpu=octeontx: .L6: ldr q1, [x0, x2] ldr q0, [x3, x1] tbl v1.16b, {v1.16b}, v2.16b tbl v0.16b, {v0.16b}, v2.16b str q1, [x3, x1] add x1, x1, 16 str q0, [x0, x2] sub x2, x2, #16 cmp x1, x5 bne .L6 With -mcpu=thunderx2t99: .L4: ldr q1, [x3], -16 ldr q0, [x2] tbl v1.16b, {v1.16b}, v2.16b tbl v0.16b, {v0.16b}, v2.16b str q1, [x2], 16 str q0, [x1], -16 cmp x2, x5 bne .L4 I am not shocked that IV-OPTS can chose these widly differences. I have not looked at the cost differences to understand why -mcpu=thunderx2t99 chose what close might be the best (we could use one less IV by replacing the first ldr by using the same IV as the last str).