https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114107
--- Comment #3 from N Schaeffer <nathanael.schaeffer at gmail dot com> --- I have not benchmarked. For 4 vmulpd doing the actual work, there are more than 40 permute/mov instructions, among which 24 vpermd instructions which have a 3 cycle latency. That is 6 vpermd per vmulpd. There is no way this can be faster than vbroadcastsd. I would bet it is 4 to 10 times slower than the vbroadcastsd loop. If you want, I can benchmark it tomorrow. If this is a cost model problem, it is a bad one. Even ignoring the decoding of all these instructions, how can adding 6 vpermd to each vmulpd be faster? I would rather think (hope?) the optimizer does not consider the vbroadcastsd solution at all.