https://gcc.gnu.org/bugzilla/show_bug.cgi?id=119596
--- Comment #14 from Mateusz Guzik <mjguzik at gmail dot com> --- So I reran the bench on AMD EPYC 9R14 and also experienced a win. To recap gcc emits rep movsq/stosq for sizes > 40. I'm replacing that with unrolled loops for sizes up to 256 and punting to actual funcs past that. All tests on fresh Linux master (a2cc6ff5ec8f91bc463fd3b0c26b61166a07eb11). fstat() rate went from ~2.4 mln to ~2.5 mln: before: min:2412348 max:2412348 total:2412348 min:2412025 max:2412025 total:2412025 min:2411442 max:2411442 total:2411442 after: min:2506723 max:2506723 total:2506723 min:2508430 max:2508430 total:2508430 min:2510306 max:2510306 total:2510306 The "hello world" build also got faster, by 1%. Total builds during test period, excluding warmup: before: 8069 after: 8136 Full results at the end. Note that while running fstat() in a loop is very microbenchmark-ey, spawning the compiler to do something is not. Finally, I don't claim an unrolled loop is the fastest thing to do for that specific uarch. I am claiming the old uarchs suffer a lot for rep movsq/stosq usage for these sizes and turns out this is also a problem for the new ones. Also note this provided a win despite increased i-cache footprint. Do you guys need results from *old* archs? Because things sucking for those is rather well established I think. full results of the hello world build: before: warmup: 403 ops (80 ops/s) bench: 806 ops (80 ops/s) taskset --cpu-list 1 ./ccbench 10 8.75s user 6.22s system 99% cpu 15.01s (15.007) total warmup: 404 ops (80 ops/s) bench: 806 ops (80 ops/s) taskset --cpu-list 1 ./ccbench 10 8.71s user 6.26s system 99% cpu 15.01s (15.013) total warmup: 404 ops (80 ops/s) bench: 807 ops (80 ops/s) taskset --cpu-list 1 ./ccbench 10 8.75s user 6.22s system 99% cpu 15.01s (15.008) total warmup: 404 ops (80 ops/s) bench: 807 ops (80 ops/s) taskset --cpu-list 1 ./ccbench 10 8.72s user 6.25s system 99% cpu 15.02s (15.019) total warmup: 404 ops (80 ops/s) bench: 807 ops (80 ops/s) taskset --cpu-list 1 ./ccbench 10 8.74s user 6.24s system 99% cpu 15.02s (15.016) total warmup: 404 ops (80 ops/s) bench: 807 ops (80 ops/s) taskset --cpu-list 1 ./ccbench 10 8.83s user 6.14s system 99% cpu 15.01s (15.006) total warmup: 404 ops (80 ops/s) bench: 807 ops (80 ops/s) taskset --cpu-list 1 ./ccbench 10 8.72s user 6.25s system 99% cpu 15.01s (15.010) total warmup: 404 ops (80 ops/s) bench: 807 ops (80 ops/s) taskset --cpu-list 1 ./ccbench 10 8.75s user 6.23s system 99% cpu 15.02s (15.025) total warmup: 404 ops (80 ops/s) bench: 807 ops (80 ops/s) taskset --cpu-list 1 ./ccbench 10 8.71s user 6.26s system 99% cpu 15.01s (15.008) total warmup: 403 ops (80 ops/s) bench: 808 ops (80 ops/s) taskset --cpu-list 1 ./ccbench 10 8.69s user 6.28s system 99% cpu 15.01s (15.011) total after: warmup: 408 ops (81 ops/s) bench: 814 ops (81 ops/s) taskset --cpu-list 1 ./ccbench 10 8.85s user 6.12s system 99% cpu 15.01s (15.010) total warmup: 407 ops (81 ops/s) bench: 813 ops (81 ops/s) taskset --cpu-list 1 ./ccbench 10 8.83s user 6.13s system 99% cpu 15.01s (15.009) total warmup: 407 ops (81 ops/s) bench: 815 ops (81 ops/s) taskset --cpu-list 1 ./ccbench 10 8.91s user 6.07s system 99% cpu 15.01s (15.014) total warmup: 407 ops (81 ops/s) bench: 812 ops (81 ops/s) taskset --cpu-list 1 ./ccbench 10 8.81s user 6.15s system 99% cpu 15.01s (15.009) total warmup: 408 ops (81 ops/s) bench: 813 ops (81 ops/s) taskset --cpu-list 1 ./ccbench 10 8.81s user 6.15s system 99% cpu 15.01s (15.011) total warmup: 407 ops (81 ops/s) bench: 813 ops (81 ops/s) taskset --cpu-list 1 ./ccbench 10 8.86s user 6.12s system 99% cpu 15.02s (15.024) total warmup: 406 ops (81 ops/s) bench: 814 ops (81 ops/s) taskset --cpu-list 1 ./ccbench 10 8.87s user 6.10s system 99% cpu 15.01s (15.013) total warmup: 408 ops (81 ops/s) bench: 813 ops (81 ops/s) taskset --cpu-list 1 ./ccbench 10 8.86s user 6.12s system 99% cpu 15.02s (15.021) total warmup: 408 ops (81 ops/s) bench: 814 ops (81 ops/s) taskset --cpu-list 1 ./ccbench 10 8.82s user 6.15s system 99% cpu 15.02s (15.017) total warmup: 409 ops (81 ops/s) bench: 815 ops (81 ops/s) taskset --cpu-list 1 ./ccbench 10 8.83s user 6.14s system 99% cpu 15.02s (15.020) total