https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108410

--- Comment #3 from Richard Biener <rguenth at gcc dot gnu.org> ---
the naiive "bad" code-gen produces

size  512-masked
  2    12.19
  4     6.09
  6     4.06
  8     3.04
 12     2.03
 14     1.52
 16     1.21
 20     1.01
 24     0.87
 32     0.76
 34     0.71
 38     0.64
 42     0.58

on alberti (you seem to have used the same machine).  So the AVX512 "stupid"
code-gen is faster for 6+ elements and I guess optimizing it should then
outperform scalar also for 4 elements.  The exact matches for 8 on 128
and 16 on 256 are hard to beat of course, likewise the single or two iteration
case.

Reply via email to