https://gcc.gnu.org/bugzilla/show_bug.cgi?id=119596
--- Comment #21 from Mateusz Guzik <mjguzik at gmail dot com> --- Given the issues outline in 119703 and 119704 I decided to microbench 2 older uarchs with select sizes. Note a better quality test which does not merely microbenchmark memset or memcpy is above for one reasonably recent AMD and one reasonably recent Intel uarch. I found that rep stosq is still highly penalized on both variants, while rep movsq suffers less on AMD but it is still faster to use regular stores at least up to 256 bytes. I fully concede it may happen to be that on a very new AMD arch this equation changes, but that's not true on Intel (new or old). Borrowing from contrib/bench-stringop I'm rolling over a 16M buffer. memcpy 256 bytes: AMD Ryzen Threadripper 2990WX testcase:256_rep8 min:27762317 max:27762317 total:27762317 min:27739493 max:27739493 total:27739493 min:27727869 max:27727869 total:27727869 testcase:256_unrolled min:28374940 max:28374940 total:28374940 min:28371060 max:28371060 total:28371060 min:28358297 max:28358297 total:28358297 Haswell: testcase:256_rep8 min:14209786 max:14209786 total:14209786 min:14192041 max:14192041 total:14192041 min:14282288 max:14282288 total:14282288 testcase:256_unrolled min:57857624 max:57857624 total:57857624 min:58826876 max:58826876 total:58826876 min:57539739 max:57539739 total:57539739 ============== memset 256 bytes: AMD Ryzen Threadripper 2990WX testcase:256_rep8 min:32776195 max:32776195 total:32776195 min:32784246 max:32784246 total:32784246 min:32838932 max:32838932 total:32838932 testcase:256_unrolled min:34131140 max:34131140 total:34131140 min:34088875 max:34088875 total:34088875 min:34076293 max:34076293 total:34076293 Haswell: testcase:256_rep8 min:24953563 max:24953563 total:24953563 min:24905210 max:24905210 total:24905210 min:24877085 max:24877085 total:24877085 testcase:256_unrolled min:58712755 max:58712755 total:58712755 min:58853415 max:58853415 total:58853415 min:58626856 max:58626856 total:58626856 ============== memset 56 bytes: AMD Ryzen Threadripper 2990WX testcase:56_rep8 min:115632478 max:115632478 total:115632478 min:115848126 max:115848126 total:115848126 min:115762251 max:115762251 total:115762251 testcase:56_unrolled min:152329392 max:152329392 total:152329392 min:152526437 max:152526437 total:152526437 min:152496941 max:152496941 total:152496941 Repro instructions: https://people.freebsd.org/~mjg/.junk/will-it-scale.tgz sh compile-copy-rolling.sh sh compile-zero-rolling.sh then e.g., ./copy_256_unrolled I can't stress enough this is a little bit naive, but should be good enough here. A non-naive test was done in previous comments where the kernel was recompiled and the benchmark consisted on actually doing something. I am not in position to do the same thing for these 2 older archs, hence the other test. Hopefully this is enough to augment the default for -mno-sse: don't rep for sizes 256 or lower.