Thanks Andrew for your prompt reply.
The results below regard my PoC which is as close to the proposed patch as
I could make. This is because I can't have chrono with my patch on godbold
for a comparison between current chrono and patched chrono.
I tried on all platforms that I could make it to compile. Please double
check everything because I might be misreading some results, especially on
the platforms that I'm not familiar with. Sometimes godbold seems to have
issues and cut pieces of the generated assembly from the output. I've
marked these cases with (X).
I suspect we want to disable this for -Os
>
Below are the sizes with -Os. Most of the time the new code is shorter than
the old one with a few exceptions where they are the same size (because the
platform doesn't seem to support [[assume]]). The new code is never longer.
On each link, the middle panel shows the result for the old code and the
right panel for the new code. These panels have tabs for different
platforms.
Old New
https://godbolt.org/z/hfz9szEWf
x86-64 0x81 0x69
ARM32 0x78 0x68
ARM64 0x81 0x71
ARM64 Morello 0x48 0x48
HPPA 0xf8 0xc8
KVX ACB 0xec 0xcc
loongarch64 0x94 0x8c (x)
https://godbolt.org/z/eMfzoPhT5
M68K 0xb6 0xa6
MinGW 0xa0 0x80 (X)
mips 0xdc 0xac
mips64 0xcc 0xb8
mipls64 (el) 0xbc 0xa8
mipsel 0xe0 0xb0
MRISC32 0xa4 0x74
power 0xb8 0x80
https://godbolt.org/z/PjqbTqK6b
power64 0xa8 0x8c
power64le 0xa4 0x88
RISC-V (32) 0x90 0x7e (X)
RISC-V (64) 0x86 0x86 (X)
s390x 0xf0 0x90
sh 0xc2 0xb2
SPARC 0xc0 0x98
SPARC LEON 0xbc 0x94
https://godbolt.org/z/7oebGMYTM
SPARC64 0xac 0x94
TI C6x 0xc4 0x98
Tricore 0xb0 0xb0
VAX 0xc8 0xc5
> And plus i am not 100% convinced it is best for all micro-architures.
> Especially on say aarch64.
> Can you do more benchmarking and peocide which exaxt core is being used?
>
I don't have access to any platform other than x86-64 to do benchmarks :-(
And mention the size difference too?
>
Same exercise explained above but with -O2:
Old New
https://godbolt.org/z/eqGo9xnz3
x86-64 0x a4 0x 72
ARM32 0x a8 0x 74
ARM64 0x 98 0x 80
ARM64 Morello 0x14c 0x14c
HPPA 0x134 0x c8
KVX ACB 0x f4 0x 98
loongarch64 0x ac 0x 9c (X)
https://godbolt.org/z/7qh94zGMK
M68K 0x13a 0x a2
MinGW 0x d0 0x 80 (X)
mips 0x11c 0x e4
mips64 0x130 0x f0
mipls64 (el) 0x120 0x e0
mipsel 0x120 0x e8
MRISC32 0x a0 0x 74
power 0x dc 0x 88
https://godbolt.org/z/Y11Trnqc1
power64 0x d0 0x 94
power64le 0x d0 0x 90
RISC-V (32) 0x bc 0x 84 (X)
RISC-V (64) 0x be 0x 94 (X)
s390x 0x f0 0x a8
sh 0x c6 0x cc (*)
SPARC 0x108 0x 9c
SPARC LEON 0x f4 0x 94
https://godbolt.org/z/h456PTEWh
SPARC64 0x c0 0x a0
TI C6x 0x108 0x b0
Tricore 0x e8 0x ea (*)
VAX 0x dc 0x dc
(*) These are the only cases where the new code is larger than the old one.
Plus gcc knows how to do %7 via multiplication is that being used or is it
> due to generic x86 tuning it is using the div instruction?
>
Yes and no. In x86-64 (and probably many other platforms) the current
optimisation for n % 7 is a byproduct of the optimisation for /, that is,
to calculate n % 7, the generated code evaluates n - (n / 7) * 7. The
quotient q = n / 7 is optimised to avoid div and uses a multiplication and
other cheaper operations. In total it evaluates 2 multiplications + shifts
+ add + subs and movs. (One multiplication is q*7 which is performed with
LEA + sub.) The algorithm that I'm suggesting, performs only one
multiplication and one. Below are the comparisons of n % 7 and the proposed
algorithm.
https://godbolt.org/z/o7dazs4Gc
https://godbolt.org/z/zP79736WK
https://godbolt.org/z/65x7naMfq
https://godbolt.org/z/z9ofaMzex
I hope this helps.
Cassio.