https://gcc.gnu.org/bugzilla/show_bug.cgi?id=87236
--- Comment #1 from Andrew Pinski <pinskia at gcc dot gnu.org> --- IIRC this is due to a tuning for generic and if you do -mcpu=Intel it won't go via memory. Basically on some AMD hw the path via memory is faster than the path between the sse and gprs.