Re: bug in xtime.h

Bruno Haible Mon, 23 Dec 2019 00:39:22 -0800

Hi Paul,

Now you got me hooked :)


> the following
> code should be faster than the other options mentioned, as it should avoid 
> both
> conditional branches and compilers' overoptimizations.
> 
>   return (t + (t < 0)) / 1000000000 - (t < 0);

The assembly code indeed shows no conditional jump and only one 64-bit
multiplication instruction.

=================================================================
long long sec1 (long long t)
{ return (t < 0 ? (t + 1) / 1000000000 - 1 : t / 1000000000); }

long long sec2 (long long t)
{ return t / 1000000000 - (t % 1000000000 < 0); }

long long sec3 (long long t)
{ return (t + (t < 0)) / 1000000000 - (t < 0); }
=================================================================

With gcc:

sec3:
        movabsq $1237940039285380275, %rdx
        movq    %rdi, %rcx
        shrq    $63, %rcx
        addq    %rcx, %rdi
        movq    %rdi, %rax
        sarq    $63, %rdi
        imulq   %rdx
        movq    %rdx, %rax
        sarq    $26, %rax
        subq    %rdi, %rax
        subq    %rcx, %rax
        ret

With clang:

sec3:
        movq    %rdi, %rcx
        shrq    $63, %rcx
        leaq    (%rdi,%rcx), %rax
        movabsq $1237940039285380275, %rdx
        imulq   %rdx
        movq    %rdx, %rax
        shrq    $63, %rax
        sarq    $26, %rdx
        addq    %rdx, %rax
        subq    %rcx, %rax
        retq

And the benchmark:

=================================================================
#include <stdlib.h>

static inline long long sec1 (long long t)
{ return (t < 0 ? (t + 1) / 1000000000 - 1 : t / 1000000000); }

static inline long long sec2 (long long t)
{ return t / 1000000000 - (t % 1000000000 < 0); }

static inline long long sec3 (long long t)
{ return (t + (t < 0)) / 1000000000 - (t < 0); }

volatile long long t = 1576800000000000000LL;
volatile long long x;

int
main (int argc, char *argv[])
{
  int repeat = atoi (argv[1]);
  int i;

  for (i = repeat; i > 0; i--)
    x = sec1 (t); // or sec2 (t) or sec3 (t)
}
=================================================================

On an Intel Core m3 CPU:

                 gcc             clang

sec1           1.28 ns          1.03 ns
sec2           1.84 ns          1.79 ns
sec3           1.71 ns          1.62 ns

And on sparc64:

                 gcc

sec1           7.79 ns
sec2           8.07 ns
sec3           7.93 ns

And on aarch64:

                 gcc

sec1          27.5 ns
sec2          55.0 ns
sec3          27.5 ns

Interpreting the results:
  - aarch64 has a slow 'sdiv' instruction, so slow that conditional
    branches don't matter.
  - On sparc64, the sdivx instructions takes much more time than the
    mulx instruction.
  - My micro-benchmark, with 10⁹ repetitions in a row, exploits branch
    prediction to an amount that is higher than realistic: In practice,
    the branch prediction cache should be filled with other stuff.
    This means, the realistic time for sec1 should be higher than what
    I measured. How much, I can't really say.
  - Since branch prediction does not matter for sec2 and sec3, we can
    see that sec3 is always faster than sec2.

=> I'm in favour of installing your new formula.

Bruno

Re: bug in xtime.h

Reply via email to