https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109690
Alexander Monakov <amonakov at gcc dot gnu.org> changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |amonakov at gcc dot gnu.org --- Comment #8 from Alexander Monakov <amonakov at gcc dot gnu.org> --- Note that the vectorized variant is latency-bound: vector load in loop() waits for the vector store into the same location done in the previous invocation of 'loop'. This makes the microbenchmark take 10 cycles per iteration (9 cycles as the vector store forwarding latency, plus 1 cycle for the ALU op). In contrast, the fully-scalar variant benefits from "memory renaming" in Zen 2 and Zen 4 (absent in Zen 3) where store-forwarding happens earlier in the pipeline with zero-cycle latency. I think it bottlenecks on taken branches.