On Wed, Mar 30, 2011 at 8:05 AM, Lassi Tuura <[email protected]> wrote: > For completeness, perhaps I should mention that I also tested with ".p2align > 2" and ".p2align 4" right before ".global _Ux86_64_getcontext_trace". The > results started to be slightly sporadic, but curiously all the aligned > versions were slightly but systematically slower than the unaligned one (by > ~1-2%). > > The function is definitely unaligned with the patch, at offset 0x4e09 into > the shared library in my case. >
These are usually related to how the x86 decoder works on your CPU. On Nehalem/Westmere generation it fetches bundles of 16 bytes and decodes up to 3 simple and one complex uop. There are a lot of interesting stories about how inserting or removing a nop from a hot loop changes throughput significantly. -Arun _______________________________________________ Libunwind-devel mailing list [email protected] http://lists.nongnu.org/mailman/listinfo/libunwind-devel
