https://gcc.gnu.org/bugzilla/show_bug.cgi?id=119596
--- Comment #5 from Andrew Pinski <pinskia at gcc dot gnu.org> ---
>Benching based on the Linux kernel and the Sapphire Rapids CPU:
With -mtune=sapphirerapids , GCC produces:
```
_Z4zeroP3foo:
.LFB0:
.cfi_startproc
mov QWORD PTR [rdi], 0
mov QWORD PTR [rdi+8], 0
mov QWORD PTR [rdi+16], 0
mov QWORD PTR [rdi+24], 0
mov QWORD PTR [rdi+32], 0
mov BYTE PTR [rdi+40], 0
ret
````
Which is what you want.
Again I will mention this:
Plus for generic tuning you need to benchmark one more than just one processor
(at least a few Intel and AMD processors).