https://gcc.gnu.org/bugzilla/show_bug.cgi?id=67072
--- Comment #4 from Peter Cordes <peter at cordes dot ca> --- I just timed with Linux perf: time taskset 0x04 perf stat -e task-clock,cycles,instructions,r1b1,r10e,r2c2,r1c2,stalled-cycles-frontend,stalled-cycles-backend ./rs-asmbench my code averages 3.57 fused-domain uops / cycle (3x 1000 iters over a 1MiB buffer). gcc's code averages 3.10 fused-domain uops / cycle (3x 1000 iters over a 1MiB buffer). So it's not just extra mov uops slowing things down. gcc's code isn't scheduled as well. Or else the extra mov uops are taking up execution units and preventing the CPU from running enough load/store uops to go beyond 3 uops per cycle.