https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99881
--- Comment #4 from Hongtao.liu <crazylht at gmail dot com> --- (In reply to Richard Biener from comment #3) > But 2 element construction _should_ be cheap. What is missing is the move > cost from GPR to XMM regs (but we do not have a good idea whether the sources > are memory, so it's not as clear-cut here either). > > IMHO a better approach might be to up unaligned vector store/load costs? > > For the testcase at hand why does a throughput of 1 pose a problem? There's > only one punpckldq instruction around? > There're several lea/add(which also may use port 5) instructions around punckldq, considering that FAST LEA and Int ALU will be common in address computation, throughput of 1 for punckldq will be a bottleneck. refer to https://godbolt.org/z/hK9r5vTzd for original case > Note that for the case of non-loop vectorization of 'double' the two element > vector CTORs are common and important to handle cheaply. See also all the > discussion in PR98856