Thanks Andrew for your prompt reply. The results below regard my PoC which is as close to the proposed patch as I could make. This is because I can't have chrono with my patch on godbold for a comparison between current chrono and patched chrono.
I tried on all platforms that I could make it to compile. Please double check everything because I might be misreading some results, especially on the platforms that I'm not familiar with. Sometimes godbold seems to have issues and cut pieces of the generated assembly from the output. I've marked these cases with (X). I suspect we want to disable this for -Os > Below are the sizes with -Os. Most of the time the new code is shorter than the old one with a few exceptions where they are the same size (because the platform doesn't seem to support [[assume]]). The new code is never longer. On each link, the middle panel shows the result for the old code and the right panel for the new code. These panels have tabs for different platforms. Old New https://godbolt.org/z/hfz9szEWf x86-64 0x81 0x69 ARM32 0x78 0x68 ARM64 0x81 0x71 ARM64 Morello 0x48 0x48 HPPA 0xf8 0xc8 KVX ACB 0xec 0xcc loongarch64 0x94 0x8c (x) https://godbolt.org/z/eMfzoPhT5 M68K 0xb6 0xa6 MinGW 0xa0 0x80 (X) mips 0xdc 0xac mips64 0xcc 0xb8 mipls64 (el) 0xbc 0xa8 mipsel 0xe0 0xb0 MRISC32 0xa4 0x74 power 0xb8 0x80 https://godbolt.org/z/PjqbTqK6b power64 0xa8 0x8c power64le 0xa4 0x88 RISC-V (32) 0x90 0x7e (X) RISC-V (64) 0x86 0x86 (X) s390x 0xf0 0x90 sh 0xc2 0xb2 SPARC 0xc0 0x98 SPARC LEON 0xbc 0x94 https://godbolt.org/z/7oebGMYTM SPARC64 0xac 0x94 TI C6x 0xc4 0x98 Tricore 0xb0 0xb0 VAX 0xc8 0xc5 > And plus i am not 100% convinced it is best for all micro-architures. > Especially on say aarch64. > Can you do more benchmarking and peocide which exaxt core is being used? > I don't have access to any platform other than x86-64 to do benchmarks :-( And mention the size difference too? > Same exercise explained above but with -O2: Old New https://godbolt.org/z/eqGo9xnz3 x86-64 0x a4 0x 72 ARM32 0x a8 0x 74 ARM64 0x 98 0x 80 ARM64 Morello 0x14c 0x14c HPPA 0x134 0x c8 KVX ACB 0x f4 0x 98 loongarch64 0x ac 0x 9c (X) https://godbolt.org/z/7qh94zGMK M68K 0x13a 0x a2 MinGW 0x d0 0x 80 (X) mips 0x11c 0x e4 mips64 0x130 0x f0 mipls64 (el) 0x120 0x e0 mipsel 0x120 0x e8 MRISC32 0x a0 0x 74 power 0x dc 0x 88 https://godbolt.org/z/Y11Trnqc1 power64 0x d0 0x 94 power64le 0x d0 0x 90 RISC-V (32) 0x bc 0x 84 (X) RISC-V (64) 0x be 0x 94 (X) s390x 0x f0 0x a8 sh 0x c6 0x cc (*) SPARC 0x108 0x 9c SPARC LEON 0x f4 0x 94 https://godbolt.org/z/h456PTEWh SPARC64 0x c0 0x a0 TI C6x 0x108 0x b0 Tricore 0x e8 0x ea (*) VAX 0x dc 0x dc (*) These are the only cases where the new code is larger than the old one. Plus gcc knows how to do %7 via multiplication is that being used or is it > due to generic x86 tuning it is using the div instruction? > Yes and no. In x86-64 (and probably many other platforms) the current optimisation for n % 7 is a byproduct of the optimisation for /, that is, to calculate n % 7, the generated code evaluates n - (n / 7) * 7. The quotient q = n / 7 is optimised to avoid div and uses a multiplication and other cheaper operations. In total it evaluates 2 multiplications + shifts + add + subs and movs. (One multiplication is q*7 which is performed with LEA + sub.) The algorithm that I'm suggesting, performs only one multiplication and one. Below are the comparisons of n % 7 and the proposed algorithm. https://godbolt.org/z/o7dazs4Gc https://godbolt.org/z/zP79736WK https://godbolt.org/z/65x7naMfq https://godbolt.org/z/z9ofaMzex I hope this helps. Cassio.