Hi Paul, > > xtime_sec (xtime_t t) > > { > > return (t < 0 > > - ? (t + XTIME_PRECISION - 1) / XTIME_PRECISION - 1 > > + ? (t + 1) / XTIME_PRECISION - 1 > > : xtime_nonnegative_sec (t)); > > Thanks for pointing out the bug. We can simplify the fix further (and speed it > up a bit on typical hosts).
While I like the code you installed - it is simpler than the one I proposed -, I must point out that it's hard to predict what speed characteristics "typical hosts" will show. When I compile this file with gcc-9.2.0 -O2 -S (or similarly with clang) ================================================================ long long sec1 (long long t) { return (t < 0 ? (t + 1) / 1000000000 - 1 : t / 1000000000); } long long sec2 (long long t) { return t / 1000000000 - (t % 1000000000 < 0); } ================================================================ I get this assembly code: sec1: testq %rdi, %rdi js .L5 movabsq $1237940039285380275, %rdx movq %rdi, %rax sarq $63, %rdi imulq %rdx movq %rdx, %rax sarq $26, %rax subq %rdi, %rax ret .L5: movabsq $1237940039285380275, %rdx addq $1, %rdi movq %rdi, %rax sarq $63, %rdi imulq %rdx sarq $26, %rdx subq %rdi, %rdx leaq -1(%rdx), %rax ret sec2: movabsq $1237940039285380275, %rdx movq %rdi, %rax imulq %rdx movq %rdx, %rax movq %rdi, %rdx sarq $63, %rdx sarq $26, %rax subq %rdx, %rax imulq $1000000000, %rax, %rdx subq %rdx, %rdi shrq $63, %rdi subq %rdi, %rax ret Similarly with clang 9: sec1: movq %rdi, %rax testq %rdi, %rdi js .LBB0_1 shrq $9, %rax movabsq $19342813113834067, %rcx mulq %rcx movq %rdx, %rax shrq $11, %rax retq .LBB0_1: addq $1, %rax movabsq $1237940039285380275, %rcx imulq %rcx movq %rdx, %rax shrq $63, %rax sarq $26, %rdx addq %rdx, %rax addq $-1, %rax retq sec2: movabsq $1237940039285380275, %rcx movq %rdi, %rax imulq %rcx movq %rdx, %rax shrq $63, %rax sarq $26, %rdx addq %rax, %rdx imulq $1000000000, %rdx, %rax subq %rax, %rdi sarq $63, %rdi leaq (%rdi,%rdx), %rax retq So, sec1 has one more conditional jump, whereas sec2 has one more 64-bit multiplication instruction in its path. How well will the branch prediction unit be able to optimize the conditional jump? ================================================================= #include <stdlib.h> static inline long long sec1 (long long t) { return (t < 0 ? (t + 1) / 1000000000 - 1 : t / 1000000000); } static inline long long sec2 (long long t) { return t / 1000000000 - (t % 1000000000 < 0); } volatile long long t = 1576800000000000000LL; volatile long long x; int main (int argc, char *argv[]) { int repeat = atoi (argv[1]); int i; for (i = repeat; i > 0; i--) x = sec1 (t); // or sec2 (t) } ================================================================= Results (compiled each with -O2, ran with argument 1000000000, on an Intel Core m3 CPU): gcc clang sec1 1.28 ns 1.04 ns sec2 1.78 ns 1.78 ns And on sparc64: gcc sec1 7.79 ns sec2 8.06 ns And on aarch64: gcc sec1 27.5 ns sec2 55.0 ns Hmm... Again: I'm not asking to optimize this particular function. Simply, from time to time, I like to question the assumptions we make about the compiler and about "typical hosts". Bruno