https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108410
--- Comment #3 from Richard Biener <rguenth at gcc dot gnu.org> --- the naiive "bad" code-gen produces size 512-masked 2 12.19 4 6.09 6 4.06 8 3.04 12 2.03 14 1.52 16 1.21 20 1.01 24 0.87 32 0.76 34 0.71 38 0.64 42 0.58 on alberti (you seem to have used the same machine). So the AVX512 "stupid" code-gen is faster for 6+ elements and I guess optimizing it should then outperform scalar also for 4 elements. The exact matches for 8 on 128 and 16 on 256 are hard to beat of course, likewise the single or two iteration case.