https://gcc.gnu.org/bugzilla/show_bug.cgi?id=119596

--- Comment #13 from Mateusz Guzik <mjguzik at gmail dot com> ---
I see there is a significant disconnect here between what I meant with this
problem report and your perspective, so I'm going to be more explicit.

Of course for best performance on a given uarch you would want to -mtune for
that uarch, but that's not the goal here.

Rather, with the Linux kernel as an example, assume the code has to be compiled
with a generic x86_64 chip in mind. Then I claim the asm emitted for small
inline memcpy and memset uses loses on performance.

Last I had a serious look at string ops optimization was around 2018 or 2019
and at that time all CPUs (AMD included) were struggling with short ops vs rep
stosq/movsq.

It makes sense to request benchmarks from other CPUs today.

To that end I'm asking what kind of standard is expected here in terms of tests
to run. As for AMD uarchs, I can get my hands on 2: EPYC Genoa (4th gen) and
EPYC 7571.

Reply via email to