https://gcc.gnu.org/bugzilla/show_bug.cgi?id=119596

--- Comment #14 from Mateusz Guzik <mjguzik at gmail dot com> ---
So I reran the bench on AMD EPYC 9R14 and also experienced a win.

To recap gcc emits rep movsq/stosq for sizes > 40. I'm replacing that with
unrolled loops for sizes up to 256 and punting to actual funcs past that.

All tests on fresh Linux master (a2cc6ff5ec8f91bc463fd3b0c26b61166a07eb11).

fstat() rate went from ~2.4 mln to ~2.5 mln:

before:
min:2412348 max:2412348 total:2412348
min:2412025 max:2412025 total:2412025
min:2411442 max:2411442 total:2411442

after:
min:2506723 max:2506723 total:2506723
min:2508430 max:2508430 total:2508430
min:2510306 max:2510306 total:2510306

The "hello world" build also got faster, by 1%.

Total builds during test period, excluding warmup:
before: 8069
after: 8136

Full results at the end.

Note that while running fstat() in a loop is very microbenchmark-ey, spawning
the compiler to do something is not.

Finally, I don't claim an unrolled loop is the fastest thing to do for that
specific uarch. I am claiming the old uarchs suffer a lot for rep movsq/stosq
usage for these sizes and turns out this is also a problem for the new ones.

Also note this provided a win despite increased i-cache footprint.

Do you guys need results from *old* archs? Because things sucking for those is
rather well established I think.

full results of the hello world build:

before:
warmup: 403 ops (80 ops/s)
bench: 806 ops (80 ops/s)
taskset --cpu-list 1 ./ccbench 10  8.75s user 6.22s system 99% cpu 15.01s
(15.007) total
warmup: 404 ops (80 ops/s)
bench: 806 ops (80 ops/s)
taskset --cpu-list 1 ./ccbench 10  8.71s user 6.26s system 99% cpu 15.01s
(15.013) total
warmup: 404 ops (80 ops/s)
bench: 807 ops (80 ops/s)
taskset --cpu-list 1 ./ccbench 10  8.75s user 6.22s system 99% cpu 15.01s
(15.008) total
warmup: 404 ops (80 ops/s)
bench: 807 ops (80 ops/s)
taskset --cpu-list 1 ./ccbench 10  8.72s user 6.25s system 99% cpu 15.02s
(15.019) total
warmup: 404 ops (80 ops/s)
bench: 807 ops (80 ops/s)
taskset --cpu-list 1 ./ccbench 10  8.74s user 6.24s system 99% cpu 15.02s
(15.016) total
warmup: 404 ops (80 ops/s)
bench: 807 ops (80 ops/s)
taskset --cpu-list 1 ./ccbench 10  8.83s user 6.14s system 99% cpu 15.01s
(15.006) total
warmup: 404 ops (80 ops/s)
bench: 807 ops (80 ops/s)
taskset --cpu-list 1 ./ccbench 10  8.72s user 6.25s system 99% cpu 15.01s
(15.010) total
warmup: 404 ops (80 ops/s)
bench: 807 ops (80 ops/s)
taskset --cpu-list 1 ./ccbench 10  8.75s user 6.23s system 99% cpu 15.02s
(15.025) total
warmup: 404 ops (80 ops/s)
bench: 807 ops (80 ops/s)
taskset --cpu-list 1 ./ccbench 10  8.71s user 6.26s system 99% cpu 15.01s
(15.008) total
warmup: 403 ops (80 ops/s)
bench: 808 ops (80 ops/s)
taskset --cpu-list 1 ./ccbench 10  8.69s user 6.28s system 99% cpu 15.01s
(15.011) total

after:
warmup: 408 ops (81 ops/s)
bench: 814 ops (81 ops/s)
taskset --cpu-list 1 ./ccbench 10  8.85s user 6.12s system 99% cpu 15.01s
(15.010) total
warmup: 407 ops (81 ops/s)
bench: 813 ops (81 ops/s)
taskset --cpu-list 1 ./ccbench 10  8.83s user 6.13s system 99% cpu 15.01s
(15.009) total
warmup: 407 ops (81 ops/s)
bench: 815 ops (81 ops/s)
taskset --cpu-list 1 ./ccbench 10  8.91s user 6.07s system 99% cpu 15.01s
(15.014) total
warmup: 407 ops (81 ops/s)
bench: 812 ops (81 ops/s)
taskset --cpu-list 1 ./ccbench 10  8.81s user 6.15s system 99% cpu 15.01s
(15.009) total
warmup: 408 ops (81 ops/s)
bench: 813 ops (81 ops/s)
taskset --cpu-list 1 ./ccbench 10  8.81s user 6.15s system 99% cpu 15.01s
(15.011) total
warmup: 407 ops (81 ops/s)
bench: 813 ops (81 ops/s)
taskset --cpu-list 1 ./ccbench 10  8.86s user 6.12s system 99% cpu 15.02s
(15.024) total
warmup: 406 ops (81 ops/s)
bench: 814 ops (81 ops/s)
taskset --cpu-list 1 ./ccbench 10  8.87s user 6.10s system 99% cpu 15.01s
(15.013) total
warmup: 408 ops (81 ops/s)
bench: 813 ops (81 ops/s)
taskset --cpu-list 1 ./ccbench 10  8.86s user 6.12s system 99% cpu 15.02s
(15.021) total
warmup: 408 ops (81 ops/s)
bench: 814 ops (81 ops/s)
taskset --cpu-list 1 ./ccbench 10  8.82s user 6.15s system 99% cpu 15.02s
(15.017) total
warmup: 409 ops (81 ops/s)
bench: 815 ops (81 ops/s)
taskset --cpu-list 1 ./ccbench 10  8.83s user 6.14s system 99% cpu 15.02s
(15.020) total

Reply via email to