https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114107
--- Comment #10 from N Schaeffer <nathanael.schaeffer at gmail dot com> --- intrestingly (and maybe surprisingly) I can get gcc to produce nearly optimal code using vbroadcastsd with the following options: -O2 -march=skylake -ftree-vectorize