https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91776

Wilco <wilco at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |wilco at gcc dot gnu.org

--- Comment #1 from Wilco <wilco at gcc dot gnu.org> ---
(In reply to yhr-_-yhr from comment #0)
> I'm doing this test on a Raspberry Pi Model 3B+. The CPU is BCM2835 ARMv7.

I think it's BM2837, ie. Cortex-A53. Or did you mean a different Pi?

> pi@rpi:~/Desktop $ gcc -Wall -march=native -mtune=native -o fibmod -O2 
> fibmod.c 
> pi@rpi:~/Desktop $ ./fibmod
> ~ 240755135 loop/s
> ~ 277965738 loop/s
> ~ 276675919 loop/s
> ~ 277244469 loop/s
> ~ 277207289 loop/s
> ~ 277303633 loop/s
> ^C
> 
> (2)
> pi@rpi:~/Desktop $ gcc -Wall -march=native -mtune=native -o fibmod -O2
> -fsplit-paths fibmod.c 
> pi@rpi:~/Desktop $ ./fibmod
> ~ 137691044 loop/s
> ~ 144593838 loop/s
> ~ 144397428 loop/s
> ~ 144519131 loop/s
> ~ 144392500 loop/s
> ^C

Can you list the assembly code for both inner loops please? This doesn't seem
like -fsplit-paths, but more likely related to -mstrict-it in Armv8. I can
reproduce a 2x slowdown with this loop if the subtract is not conditionally
executed. This happens if the register allocator uses a high register:

fast case:
        cmp     r4, r3
        it      ls
        subls   r3, r3, r4

slow case:
        cmp     r10, r3
        bhi     .L2
        sub     r3, r3, r10
.L2:

Can you try using -mno-strict-it on your examples and see whether that helps?

Reply via email to