https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99881

--- Comment #4 from Hongtao.liu <crazylht at gmail dot com> ---
(In reply to Richard Biener from comment #3)
> But 2 element construction _should_ be cheap.  What is missing is the move
> cost from GPR to XMM regs (but we do not have a good idea whether the sources
> are memory, so it's not as clear-cut here either).
> 
> IMHO a better approach might be to up unaligned vector store/load costs?
> 
> For the testcase at hand why does a throughput of 1 pose a problem?  There's
> only one punpckldq instruction around?
> 

There're several lea/add(which also may use port 5) instructions around
punckldq, considering that FAST LEA and Int ALU will be common in address
computation, throughput of 1 for punckldq will be a bottleneck.

refer to https://godbolt.org/z/hK9r5vTzd for original case

> Note that for the case of non-loop vectorization of 'double' the two element
> vector CTORs are common and important to handle cheaply.  See also all the
> discussion in PR98856

Reply via email to