Hi all,

After reflecting on my previous message and Andrew's, I now believe this
patch is not the best solution to optimise the day of the week. Instead,
the optimisation for n % 7 should be done by the compiler depending on the
platform.

I'll open a missing optimisation opportunity bug report against GCC with my
suggestions.

Therefore, expecting a better solution from the compiler, I'd ask you to
disregard this patch.

Best wishes,
Cassio.

On Sun, 11 May 2025 at 01:59, Cassio Neri <cassio.n...@gmail.com> wrote:

> Thanks Andrew for your prompt reply.
>
> The results below regard my PoC which is as close to the proposed patch as
> I could make. This is because I can't have chrono with my patch on godbold
> for a comparison between current chrono and patched chrono.
>
> I tried on all platforms that I could make it to compile. Please double
> check everything because I might be misreading some results, especially on
> the platforms that I'm not familiar with. Sometimes godbold seems to have
> issues and cut pieces of the generated assembly from the output. I've
> marked these cases with (X).
>
> I suspect we want to disable this for -Os
>>
>
> Below are the sizes with -Os. Most of the time the new code is shorter
> than the old one with a few exceptions where they are the same size
> (because the platform doesn't seem to support [[assume]]). The new code is
> never longer. On each link, the middle panel shows the result for the old
> code and the right panel for the new code. These panels have tabs for
> different platforms.
>
>                 Old   New
>
> https://godbolt.org/z/hfz9szEWf
>
> x86-64         0x81  0x69
> ARM32          0x78  0x68
> ARM64          0x81  0x71
> ARM64 Morello  0x48  0x48
> HPPA           0xf8  0xc8
> KVX ACB        0xec  0xcc
> loongarch64    0x94  0x8c  (x)
>
> https://godbolt.org/z/eMfzoPhT5
>
> M68K           0xb6  0xa6
> MinGW          0xa0  0x80  (X)
> mips           0xdc  0xac
> mips64         0xcc  0xb8
> mipls64 (el)   0xbc  0xa8
> mipsel         0xe0  0xb0
> MRISC32        0xa4  0x74
> power          0xb8  0x80
>
> https://godbolt.org/z/PjqbTqK6b
>
> power64        0xa8  0x8c
> power64le      0xa4  0x88
> RISC-V (32)    0x90  0x7e  (X)
> RISC-V (64)    0x86  0x86  (X)
> s390x          0xf0  0x90
> sh             0xc2  0xb2
> SPARC          0xc0  0x98
> SPARC LEON     0xbc  0x94
>
> https://godbolt.org/z/7oebGMYTM
>
> SPARC64        0xac  0x94
> TI C6x         0xc4  0x98
> Tricore        0xb0  0xb0
> VAX            0xc8  0xc5
>
>
>> And plus i am not 100% convinced it is best for all micro-architures.
>> Especially on say aarch64.
>> Can you do more benchmarking and peocide which exaxt core is being used?
>>
>
> I don't have access to any platform other than x86-64 to do benchmarks :-(
>
> And mention the size difference too?
>>
>
> Same exercise explained above but with -O2:
>
>                  Old    New
> https://godbolt.org/z/eqGo9xnz3
>
> x86-64         0x a4  0x 72
> ARM32          0x a8  0x 74
> ARM64          0x 98  0x 80
> ARM64 Morello  0x14c  0x14c
> HPPA           0x134  0x c8
> KVX ACB        0x f4  0x 98
> loongarch64    0x ac  0x 9c  (X)
>
> https://godbolt.org/z/7qh94zGMK
>
> M68K           0x13a  0x a2
> MinGW          0x d0  0x 80  (X)
> mips           0x11c  0x e4
> mips64         0x130  0x f0
> mipls64 (el)   0x120  0x e0
> mipsel         0x120  0x e8
> MRISC32        0x a0  0x 74
> power          0x dc  0x 88
>
> https://godbolt.org/z/Y11Trnqc1
>
> power64        0x d0  0x 94
> power64le      0x d0  0x 90
> RISC-V (32)    0x bc  0x 84  (X)
> RISC-V (64)    0x be  0x 94  (X)
> s390x          0x f0  0x a8
> sh             0x c6  0x cc  (*)
> SPARC          0x108  0x 9c
> SPARC LEON     0x f4  0x 94
>
> https://godbolt.org/z/h456PTEWh
>
> SPARC64        0x c0  0x a0
> TI C6x         0x108  0x b0
> Tricore        0x e8  0x ea  (*)
> VAX            0x dc  0x dc
>
> (*) These are the only cases where the new code is larger than the old one.
>
> Plus gcc knows how to do %7 via multiplication is that being used or is it
>> due to generic x86 tuning it is using the div instruction?
>>
>
> Yes and no. In x86-64 (and probably many other platforms) the current
> optimisation for n % 7 is a byproduct of the optimisation for /, that is,
> to calculate n % 7, the generated code evaluates n - (n / 7) * 7. The
> quotient q = n / 7 is optimised to avoid div and uses a multiplication and
> other cheaper operations. In total it evaluates 2 multiplications +
> shifts + add + subs and movs. (One multiplication is q*7 which is performed
> with LEA + sub.) The algorithm that I'm suggesting, performs only one
> multiplication and one. Below are the comparisons of n % 7 and the proposed
> algorithm.
>
> https://godbolt.org/z/o7dazs4Gc
> https://godbolt.org/z/zP79736WK
> https://godbolt.org/z/65x7naMfq
> https://godbolt.org/z/z9ofaMzex
>
> I hope this helps.
> Cassio.
>
>

Reply via email to