https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80820
--- Comment #3 from Peter Cordes <peter at cordes dot ca> --- Also, going the other direction is not symmetric. On some CPUs, a store/reload strategy for xmm->int might be better even if an ALU strategy for int->xmm is best. Also, the choice can depend on chunk size, since loads are cheap (2 per clock for AMD since K8 and Intel since SnB). And store-forwarding works. Doing the first one with movd and the next with store/reload might be good, too, on some CPUs. especially if there's some independent work that can happen for the movd result. I also discussed some of this at the bottom of the first post in https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80833.
