https://gcc.gnu.org/bugzilla/show_bug.cgi?id=119596

--- Comment #21 from Mateusz Guzik <mjguzik at gmail dot com> ---
Given the issues outline in 119703 and 119704 I decided to microbench 2 older 
uarchs with select sizes. Note a better quality test which does not merely
microbenchmark memset or memcpy is above for one reasonably recent AMD and one
reasonably recent Intel uarch.

I found that rep stosq is still highly penalized on both variants, while rep
movsq suffers less on AMD but it is still faster to use regular stores at least
up to 256 bytes. I fully concede it may happen to be that on a very new AMD
arch this equation changes, but that's not true on Intel (new or old).

Borrowing from contrib/bench-stringop I'm rolling over a 16M buffer.

memcpy 256 bytes:

AMD Ryzen Threadripper 2990WX

testcase:256_rep8
min:27762317 max:27762317 total:27762317
min:27739493 max:27739493 total:27739493
min:27727869 max:27727869 total:27727869

testcase:256_unrolled
min:28374940 max:28374940 total:28374940
min:28371060 max:28371060 total:28371060
min:28358297 max:28358297 total:28358297

Haswell:

testcase:256_rep8
min:14209786 max:14209786 total:14209786
min:14192041 max:14192041 total:14192041
min:14282288 max:14282288 total:14282288

testcase:256_unrolled
min:57857624 max:57857624 total:57857624
min:58826876 max:58826876 total:58826876
min:57539739 max:57539739 total:57539739

==============

memset 256 bytes:

AMD Ryzen Threadripper 2990WX

testcase:256_rep8
min:32776195 max:32776195 total:32776195
min:32784246 max:32784246 total:32784246
min:32838932 max:32838932 total:32838932

testcase:256_unrolled
min:34131140 max:34131140 total:34131140
min:34088875 max:34088875 total:34088875
min:34076293 max:34076293 total:34076293

Haswell:

testcase:256_rep8
min:24953563 max:24953563 total:24953563
min:24905210 max:24905210 total:24905210
min:24877085 max:24877085 total:24877085

testcase:256_unrolled
min:58712755 max:58712755 total:58712755
min:58853415 max:58853415 total:58853415
min:58626856 max:58626856 total:58626856

==============

memset 56 bytes:

AMD Ryzen Threadripper 2990WX

testcase:56_rep8
min:115632478 max:115632478 total:115632478
min:115848126 max:115848126 total:115848126
min:115762251 max:115762251 total:115762251

testcase:56_unrolled
min:152329392 max:152329392 total:152329392
min:152526437 max:152526437 total:152526437
min:152496941 max:152496941 total:152496941

Repro instructions:
https://people.freebsd.org/~mjg/.junk/will-it-scale.tgz

sh compile-copy-rolling.sh
sh compile-zero-rolling.sh

then e.g., ./copy_256_unrolled

I can't stress enough this is a little bit naive, but should be good enough
here. A non-naive test was done in previous comments where the kernel was
recompiled and the benchmark consisted on actually doing something.

I am not in position to do the same thing for these 2 older archs, hence the
other test.

Hopefully this is enough to augment the default for -mno-sse: don't rep for
sizes 256 or lower.

Reply via email to