https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92244

--- Comment #4 from Peter Cordes <peter at cordes dot ca> ---
(In reply to Andrew Pinski from comment #3)
> (In reply to Peter Cordes from comment #1)
> > On AArch64 (with gcc8.2), we see a similar effect, more instructions in the
> > loop.  And an indexed addressing mode.

That was an overstatement, the generic tuning I showed isn't using 2 separate
pointers or indices like we get on x86.

Your thunderx2t99 output is like that, but write-back addressing modes mean it
doesn't cost extra instructions.

> I am not shocked that IV-OPTS can chose these widly differences.
> I have not looked at the cost differences to understand why
> -mcpu=thunderx2t99 chose what close might be the best (we could use one less
> IV by replacing the first ldr by using the same IV as the last str).

I don't know ARM tuning; the x86 version is clearly worse with an extra uop
inside the loop.  And an extra instruction to copy the register before the
loop, wasting code-size if nothing else.

On Skylake for example, the loop is 10 uops and bottlenecks on front-end
throughput (4 uops / clock) if the back-end can keep up with a bit less than 1
store per clock.  (Easy if pointers are aligned and data is hot in L1d). 
Reducing it to 9 uops should help in practice.  Getting it down to 8 uops would
be really nice, but we can't do that unless we could use a shuffle that
micro-fuses with a load.  (For int elements, AVX2 VPERMD can micro-fuse a
memory source, so can SSE2 PSHUFD.  pshufb's xmm/memory operand is the control
vector which doesn't help us.  AVX512 vpermb can't micro-fuse)

Reply via email to