https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99881
Richard Biener <rguenth at gcc dot gnu.org> changed: What |Removed |Added ---------------------------------------------------------------------------- Target| |x86_64-*-* --- Comment #3 from Richard Biener <rguenth at gcc dot gnu.org> --- But 2 element construction _should_ be cheap. What is missing is the move cost from GPR to XMM regs (but we do not have a good idea whether the sources are memory, so it's not as clear-cut here either). IMHO a better approach might be to up unaligned vector store/load costs? For the testcase at hand why does a throughput of 1 pose a problem? There's only one punpckldq instruction around? Note that for the case of non-loop vectorization of 'double' the two element vector CTORs are common and important to handle cheaply. See also all the discussion in PR98856