[Bug target/114107] poor vectorization at -O3 when dealing with arrays of different multiplicity, good with -O2

nathanael.schaeffer at gmail dot com via Gcc-bugs Sun, 25 Feb 2024 16:13:07 -0800

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114107


--- Comment #3 from N Schaeffer <nathanael.schaeffer at gmail dot com> ---
I have not benchmarked.
For 4 vmulpd doing the actual work, there are more than 40 permute/mov
instructions, among which 24 vpermd instructions which have a 3 cycle latency.
That is 6 vpermd per vmulpd.
There is no way this can be faster than vbroadcastsd. I would bet it is 4 to 10
times slower than the vbroadcastsd loop.
If you want, I can benchmark it tomorrow.

If this is a cost model problem, it is a bad one. Even ignoring the decoding of
all these instructions, how can adding 6 vpermd to each vmulpd be faster?
I would rather think (hope?) the optimizer does not consider the vbroadcastsd
solution at all.

[Bug target/114107] poor vectorization at -O3 when dealing with arrays of different multiplicity, good with -O2

Reply via email to