https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92244
--- Comment #4 from Peter Cordes <peter at cordes dot ca> --- (In reply to Andrew Pinski from comment #3) > (In reply to Peter Cordes from comment #1) > > On AArch64 (with gcc8.2), we see a similar effect, more instructions in the > > loop. And an indexed addressing mode. That was an overstatement, the generic tuning I showed isn't using 2 separate pointers or indices like we get on x86. Your thunderx2t99 output is like that, but write-back addressing modes mean it doesn't cost extra instructions. > I am not shocked that IV-OPTS can chose these widly differences. > I have not looked at the cost differences to understand why > -mcpu=thunderx2t99 chose what close might be the best (we could use one less > IV by replacing the first ldr by using the same IV as the last str). I don't know ARM tuning; the x86 version is clearly worse with an extra uop inside the loop. And an extra instruction to copy the register before the loop, wasting code-size if nothing else. On Skylake for example, the loop is 10 uops and bottlenecks on front-end throughput (4 uops / clock) if the back-end can keep up with a bit less than 1 store per clock. (Easy if pointers are aligned and data is hot in L1d). Reducing it to 9 uops should help in practice. Getting it down to 8 uops would be really nice, but we can't do that unless we could use a shuffle that micro-fuses with a load. (For int elements, AVX2 VPERMD can micro-fuse a memory source, so can SSE2 PSHUFD. pshufb's xmm/memory operand is the control vector which doesn't help us. AVX512 vpermb can't micro-fuse)