https://gcc.gnu.org/bugzilla/show_bug.cgi?id=119960
--- Comment #1 from Arseny Kapoulkine <arseny.kapoulkine at gmail dot com> --- Here's what I think is happening: - The C code hot path is, on every iteration, reading a pair of integers from `edgefifo`, decoding a triangle and writing two new pairs of integers to `edgefifo`. It's very common for the edge to be written on one iteration and read on the next iteration. - In gcc 14, the read is done twice; once into a 64-bit GPR (rbp), and once into two separate XMM registers (xmm0/xmm1). Each of the two writes is using a 64-bit movq from the XMM registers. - In gcc 15, the read is done once as a 64-bit value into an XMM register (xmm0). Both writes are split and are done one pair element at a time (both from XMM and from GPR). I assume this breaks store-to-load forwarding, or makes it more expensive: for code generated by gcc 14, the CPU only needs to know how to forward 64-bit stores into separate 32-bit halves; for code generated by gcc 15, CPU would need to know how to combine two 32-bit writes into a single 64-bit load, and Zen 4 (just as many other CPUs) can't do that. AMD uProf shows ~47M successfull store->load forwards on gcc15 binary and ~151M on gcc14 binary.