https://gcc.gnu.org/bugzilla/show_bug.cgi?id=67072
--- Comment #2 from Peter Cordes <peter at cordes dot ca> --- I restructured the intrinsics loop to match the asm. This gave a small speedup, but it's still ~30% slower than my asm. clang-3.5 is even slower than gcc. I still don't software-pipeline the loads in C the way I do in asm, that kind of instruction scheduling is something compilers should do, shouldn't they? Still, that shouldn't make a 30% difference. The re-order buffer should be big enough to hold the uops for a few independent iterations. https://github.com/pcordes/par2-asm-experiments/blob/d061202b69218963fdc03619be208327e03ceb71/intrin-pinsrw.c Again, timings are on an Intel Sandybridge i5-2500k, with the system mostly idle.