http://gcc.gnu.org/bugzilla/show_bug.cgi?id=47556
--- Comment #2 from Jeremy Fitzhardinge <jeremy at goop dot org> 2011-01-31 22:16:54 UTC --- Hm, yes, I see. The hand-written asm, which uses %ah, does appear to run into false partial register stalls according to 3.5.2.3 in the Intel Optimisation Reference Manual. On the other hand, the code generated by the C version appears to be slightly slower in measurement on a Nehalem system. Since the code in question is all in the slow path (its the spin loop for a spinlock), perhaps the increased icache pressure from the increased code size is more significant than the register stalls. Compiling with -Os rather than -O2 makes no difference.