Hi all, After reflecting on my previous message and Andrew's, I now believe this patch is not the best solution to optimise the day of the week. Instead, the optimisation for n % 7 should be done by the compiler depending on the platform.
I'll open a missing optimisation opportunity bug report against GCC with my suggestions. Therefore, expecting a better solution from the compiler, I'd ask you to disregard this patch. Best wishes, Cassio. On Sun, 11 May 2025 at 01:59, Cassio Neri <cassio.n...@gmail.com> wrote: > Thanks Andrew for your prompt reply. > > The results below regard my PoC which is as close to the proposed patch as > I could make. This is because I can't have chrono with my patch on godbold > for a comparison between current chrono and patched chrono. > > I tried on all platforms that I could make it to compile. Please double > check everything because I might be misreading some results, especially on > the platforms that I'm not familiar with. Sometimes godbold seems to have > issues and cut pieces of the generated assembly from the output. I've > marked these cases with (X). > > I suspect we want to disable this for -Os >> > > Below are the sizes with -Os. Most of the time the new code is shorter > than the old one with a few exceptions where they are the same size > (because the platform doesn't seem to support [[assume]]). The new code is > never longer. On each link, the middle panel shows the result for the old > code and the right panel for the new code. These panels have tabs for > different platforms. > > Old New > > https://godbolt.org/z/hfz9szEWf > > x86-64 0x81 0x69 > ARM32 0x78 0x68 > ARM64 0x81 0x71 > ARM64 Morello 0x48 0x48 > HPPA 0xf8 0xc8 > KVX ACB 0xec 0xcc > loongarch64 0x94 0x8c (x) > > https://godbolt.org/z/eMfzoPhT5 > > M68K 0xb6 0xa6 > MinGW 0xa0 0x80 (X) > mips 0xdc 0xac > mips64 0xcc 0xb8 > mipls64 (el) 0xbc 0xa8 > mipsel 0xe0 0xb0 > MRISC32 0xa4 0x74 > power 0xb8 0x80 > > https://godbolt.org/z/PjqbTqK6b > > power64 0xa8 0x8c > power64le 0xa4 0x88 > RISC-V (32) 0x90 0x7e (X) > RISC-V (64) 0x86 0x86 (X) > s390x 0xf0 0x90 > sh 0xc2 0xb2 > SPARC 0xc0 0x98 > SPARC LEON 0xbc 0x94 > > https://godbolt.org/z/7oebGMYTM > > SPARC64 0xac 0x94 > TI C6x 0xc4 0x98 > Tricore 0xb0 0xb0 > VAX 0xc8 0xc5 > > >> And plus i am not 100% convinced it is best for all micro-architures. >> Especially on say aarch64. >> Can you do more benchmarking and peocide which exaxt core is being used? >> > > I don't have access to any platform other than x86-64 to do benchmarks :-( > > And mention the size difference too? >> > > Same exercise explained above but with -O2: > > Old New > https://godbolt.org/z/eqGo9xnz3 > > x86-64 0x a4 0x 72 > ARM32 0x a8 0x 74 > ARM64 0x 98 0x 80 > ARM64 Morello 0x14c 0x14c > HPPA 0x134 0x c8 > KVX ACB 0x f4 0x 98 > loongarch64 0x ac 0x 9c (X) > > https://godbolt.org/z/7qh94zGMK > > M68K 0x13a 0x a2 > MinGW 0x d0 0x 80 (X) > mips 0x11c 0x e4 > mips64 0x130 0x f0 > mipls64 (el) 0x120 0x e0 > mipsel 0x120 0x e8 > MRISC32 0x a0 0x 74 > power 0x dc 0x 88 > > https://godbolt.org/z/Y11Trnqc1 > > power64 0x d0 0x 94 > power64le 0x d0 0x 90 > RISC-V (32) 0x bc 0x 84 (X) > RISC-V (64) 0x be 0x 94 (X) > s390x 0x f0 0x a8 > sh 0x c6 0x cc (*) > SPARC 0x108 0x 9c > SPARC LEON 0x f4 0x 94 > > https://godbolt.org/z/h456PTEWh > > SPARC64 0x c0 0x a0 > TI C6x 0x108 0x b0 > Tricore 0x e8 0x ea (*) > VAX 0x dc 0x dc > > (*) These are the only cases where the new code is larger than the old one. > > Plus gcc knows how to do %7 via multiplication is that being used or is it >> due to generic x86 tuning it is using the div instruction? >> > > Yes and no. In x86-64 (and probably many other platforms) the current > optimisation for n % 7 is a byproduct of the optimisation for /, that is, > to calculate n % 7, the generated code evaluates n - (n / 7) * 7. The > quotient q = n / 7 is optimised to avoid div and uses a multiplication and > other cheaper operations. In total it evaluates 2 multiplications + > shifts + add + subs and movs. (One multiplication is q*7 which is performed > with LEA + sub.) The algorithm that I'm suggesting, performs only one > multiplication and one. Below are the comparisons of n % 7 and the proposed > algorithm. > > https://godbolt.org/z/o7dazs4Gc > https://godbolt.org/z/zP79736WK > https://godbolt.org/z/65x7naMfq > https://godbolt.org/z/z9ofaMzex > > I hope this helps. > Cassio. > >