On Wed, Mar 30, 2011 at 8:05 AM, Lassi Tuura <[email protected]> wrote:

> For completeness, perhaps I should mention that I also tested with ".p2align 
> 2" and ".p2align 4" right before ".global _Ux86_64_getcontext_trace". The 
> results started to be slightly sporadic, but curiously all the aligned 
> versions were slightly but systematically slower than the unaligned one (by 
> ~1-2%).
>
> The function is definitely unaligned with the patch, at offset 0x4e09 into 
> the shared library in my case.
>

These are usually related to how the x86 decoder works on your CPU. On
Nehalem/Westmere generation it fetches bundles of 16 bytes and decodes
up to 3 simple and one complex uop. There are a lot of interesting
stories about how inserting or removing a nop from a hot loop changes
throughput significantly.

 -Arun

_______________________________________________
Libunwind-devel mailing list
[email protected]
http://lists.nongnu.org/mailman/listinfo/libunwind-devel

Reply via email to