[Bug rtl-optimization/67072] Slow code generated for getting each byte of a 64bit register as a LUT index.

peter at cordes dot ca Thu, 30 Jul 2015 18:07:27 -0700

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=67072


--- Comment #2 from Peter Cordes <peter at cordes dot ca> ---
I restructured the intrinsics loop to match the asm.  This gave a small
speedup, but it's still ~30% slower than my asm.  clang-3.5 is even slower than
gcc.

I still don't software-pipeline the loads in C the way I do in asm, that kind
of instruction scheduling is something compilers should do, shouldn't they? 
Still, that shouldn't make a 30% difference.  The re-order buffer should be big
enough to hold the uops for a few independent iterations.

https://github.com/pcordes/par2-asm-experiments/blob/d061202b69218963fdc03619be208327e03ceb71/intrin-pinsrw.c


Again, timings are on an Intel Sandybridge i5-2500k, with the system mostly
idle.

[Bug rtl-optimization/67072] Slow code generated for getting each byte of a 64bit register as a LUT index.

Reply via email to