https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114107

--- Comment #3 from N Schaeffer <nathanael.schaeffer at gmail dot com> ---
I have not benchmarked.
For 4 vmulpd doing the actual work, there are more than 40 permute/mov
instructions, among which 24 vpermd instructions which have a 3 cycle latency.
That is 6 vpermd per vmulpd.
There is no way this can be faster than vbroadcastsd. I would bet it is 4 to 10
times slower than the vbroadcastsd loop.
If you want, I can benchmark it tomorrow.

If this is a cost model problem, it is a bad one. Even ignoring the decoding of
all these instructions, how can adding 6 vpermd to each vmulpd be faster?
I would rather think (hope?) the optimizer does not consider the vbroadcastsd
solution at all.

Reply via email to