https://gcc.gnu.org/bugzilla/show_bug.cgi?id=119960

--- Comment #1 from Arseny Kapoulkine <arseny.kapoulkine at gmail dot com> ---
Here's what I think is happening:

- The C code hot path is, on every iteration, reading a pair of integers from
`edgefifo`, decoding a triangle and writing two new pairs of integers to
`edgefifo`. It's very common for the edge to be written on one iteration and
read on the next iteration.

- In gcc 14, the read is done twice; once into a 64-bit GPR (rbp), and once
into two separate XMM registers (xmm0/xmm1). Each of the two writes is using a
64-bit movq from the XMM registers.

- In gcc 15, the read is done once as a 64-bit value into an XMM register
(xmm0). Both writes are split and are done one pair element at a time (both
from XMM and from GPR).

I assume this breaks store-to-load forwarding, or makes it more expensive: for
code generated by gcc 14, the CPU only needs to know how to forward 64-bit
stores into separate 32-bit halves; for code generated by gcc 15, CPU would
need to know how to combine two 32-bit writes into a single 64-bit load, and
Zen 4 (just as many other CPUs) can't do that.

AMD uProf shows ~47M successfull store->load forwards on gcc15 binary and ~151M
on gcc14 binary.

Reply via email to