https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109690

Alexander Monakov <amonakov at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |amonakov at gcc dot gnu.org

--- Comment #8 from Alexander Monakov <amonakov at gcc dot gnu.org> ---
Note that the vectorized variant is latency-bound: vector load in loop() waits
for the vector store into the same location done in the previous invocation of
'loop'. This makes the microbenchmark take 10 cycles per iteration (9 cycles as
the vector store forwarding latency, plus 1 cycle for the ALU op).

In contrast, the fully-scalar variant benefits from "memory renaming" in Zen 2
and Zen 4 (absent in Zen 3) where store-forwarding happens earlier in the
pipeline with zero-cycle latency. I think it bottlenecks on taken branches.

Reply via email to