Hi Paul, Now you got me hooked :)
> the following > code should be faster than the other options mentioned, as it should avoid > both > conditional branches and compilers' overoptimizations. > > return (t + (t < 0)) / 1000000000 - (t < 0); The assembly code indeed shows no conditional jump and only one 64-bit multiplication instruction. ================================================================= long long sec1 (long long t) { return (t < 0 ? (t + 1) / 1000000000 - 1 : t / 1000000000); } long long sec2 (long long t) { return t / 1000000000 - (t % 1000000000 < 0); } long long sec3 (long long t) { return (t + (t < 0)) / 1000000000 - (t < 0); } ================================================================= With gcc: sec3: movabsq $1237940039285380275, %rdx movq %rdi, %rcx shrq $63, %rcx addq %rcx, %rdi movq %rdi, %rax sarq $63, %rdi imulq %rdx movq %rdx, %rax sarq $26, %rax subq %rdi, %rax subq %rcx, %rax ret With clang: sec3: movq %rdi, %rcx shrq $63, %rcx leaq (%rdi,%rcx), %rax movabsq $1237940039285380275, %rdx imulq %rdx movq %rdx, %rax shrq $63, %rax sarq $26, %rdx addq %rdx, %rax subq %rcx, %rax retq And the benchmark: ================================================================= #include <stdlib.h> static inline long long sec1 (long long t) { return (t < 0 ? (t + 1) / 1000000000 - 1 : t / 1000000000); } static inline long long sec2 (long long t) { return t / 1000000000 - (t % 1000000000 < 0); } static inline long long sec3 (long long t) { return (t + (t < 0)) / 1000000000 - (t < 0); } volatile long long t = 1576800000000000000LL; volatile long long x; int main (int argc, char *argv[]) { int repeat = atoi (argv[1]); int i; for (i = repeat; i > 0; i--) x = sec1 (t); // or sec2 (t) or sec3 (t) } ================================================================= On an Intel Core m3 CPU: gcc clang sec1 1.28 ns 1.03 ns sec2 1.84 ns 1.79 ns sec3 1.71 ns 1.62 ns And on sparc64: gcc sec1 7.79 ns sec2 8.07 ns sec3 7.93 ns And on aarch64: gcc sec1 27.5 ns sec2 55.0 ns sec3 27.5 ns Interpreting the results: - aarch64 has a slow 'sdiv' instruction, so slow that conditional branches don't matter. - On sparc64, the sdivx instructions takes much more time than the mulx instruction. - My micro-benchmark, with 10⁹ repetitions in a row, exploits branch prediction to an amount that is higher than realistic: In practice, the branch prediction cache should be filled with other stuff. This means, the realistic time for sec1 should be higher than what I measured. How much, I can't really say. - Since branch prediction does not matter for sec2 and sec3, we can see that sec3 is always faster than sec2. => I'm in favour of installing your new formula. Bruno