[Bug target/99881] Regression compare -O2 -ftree-vectorize with -O2 on SKX/CLX

crazylht at gmail dot com via Gcc-bugs Tue, 06 Apr 2021 03:06:25 -0700

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99881


--- Comment #4 from Hongtao.liu <crazylht at gmail dot com> ---
(In reply to Richard Biener from comment #3)
> But 2 element construction _should_ be cheap.  What is missing is the move
> cost from GPR to XMM regs (but we do not have a good idea whether the sources
> are memory, so it's not as clear-cut here either).
> 
> IMHO a better approach might be to up unaligned vector store/load costs?
> 
> For the testcase at hand why does a throughput of 1 pose a problem?  There's
> only one punpckldq instruction around?
> 

There're several lea/add(which also may use port 5) instructions around
punckldq, considering that FAST LEA and Int ALU will be common in address
computation, throughput of 1 for punckldq will be a bottleneck.

refer to https://godbolt.org/z/hK9r5vTzd for original case

> Note that for the case of non-loop vectorization of 'double' the two element
> vector CTORs are common and important to handle cheaply.  See also all the
> discussion in PR98856

[Bug target/99881] Regression compare -O2 -ftree-vectorize with -O2 on SKX/CLX

Reply via email to