https://gcc.gnu.org/bugzilla/show_bug.cgi?id=87608
--- Comment #2 from Thomas Koenig <tkoenig at gcc dot gnu.org> --- (In reply to Alexander Monakov from comment #1) > Note the compiler can evaluate the initialization loop and then also > evaluate the effect of static_sort1 call, so the testcase might give > misleading results. To avoid that, pass the address of 'a' to rdtsc, or > introduce a compiler barrier with an asm: > > asm volatile ("" :: "r"(a) : "memory"); > > Furthermore, note that the CPU executes the rdtsc instruction without > waiting for all preceding computations to complete. Using lfence just before > rdtsc will ensure that rdtsc reads the cycle counter only after all > preceding computations are done. Thanks for the hint. I added the memory barrier to the code, it didn't make any appreciable difference to the timing.