https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63791
H.J. Lu <hjl.tools at gmail dot com> changed: What |Removed |Added ---------------------------------------------------------------------------- Status|NEW |RESOLVED Resolution|--- |INVALID --- Comment #5 from H.J. Lu <hjl.tools at gmail dot com> --- (In reply to Marcus Kool from comment #2) > > To resume, gcc 4.8.4 and gcc 4.9.2 produce code that can be optimised > further, and gcc 5.1.0 produces even slower code which means that the > implementation of *_set1_epi8() is slower/much-slower than that it can be. That is done on purpose. Add -mtune=intel will get what you want.