https://gcc.gnu.org/bugzilla/show_bug.cgi?id=119596
--- Comment #13 from Mateusz Guzik <mjguzik at gmail dot com> --- I see there is a significant disconnect here between what I meant with this problem report and your perspective, so I'm going to be more explicit. Of course for best performance on a given uarch you would want to -mtune for that uarch, but that's not the goal here. Rather, with the Linux kernel as an example, assume the code has to be compiled with a generic x86_64 chip in mind. Then I claim the asm emitted for small inline memcpy and memset uses loses on performance. Last I had a serious look at string ops optimization was around 2018 or 2019 and at that time all CPUs (AMD included) were struggling with short ops vs rep stosq/movsq. It makes sense to request benchmarks from other CPUs today. To that end I'm asking what kind of standard is expected here in terms of tests to run. As for AMD uarchs, I can get my hands on 2: EPYC Genoa (4th gen) and EPYC 7571.