https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91776
Wilco <wilco at gcc dot gnu.org> changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |wilco at gcc dot gnu.org --- Comment #1 from Wilco <wilco at gcc dot gnu.org> --- (In reply to yhr-_-yhr from comment #0) > I'm doing this test on a Raspberry Pi Model 3B+. The CPU is BCM2835 ARMv7. I think it's BM2837, ie. Cortex-A53. Or did you mean a different Pi? > pi@rpi:~/Desktop $ gcc -Wall -march=native -mtune=native -o fibmod -O2 > fibmod.c > pi@rpi:~/Desktop $ ./fibmod > ~ 240755135 loop/s > ~ 277965738 loop/s > ~ 276675919 loop/s > ~ 277244469 loop/s > ~ 277207289 loop/s > ~ 277303633 loop/s > ^C > > (2) > pi@rpi:~/Desktop $ gcc -Wall -march=native -mtune=native -o fibmod -O2 > -fsplit-paths fibmod.c > pi@rpi:~/Desktop $ ./fibmod > ~ 137691044 loop/s > ~ 144593838 loop/s > ~ 144397428 loop/s > ~ 144519131 loop/s > ~ 144392500 loop/s > ^C Can you list the assembly code for both inner loops please? This doesn't seem like -fsplit-paths, but more likely related to -mstrict-it in Armv8. I can reproduce a 2x slowdown with this loop if the subtract is not conditionally executed. This happens if the register allocator uses a high register: fast case: cmp r4, r3 it ls subls r3, r3, r4 slow case: cmp r10, r3 bhi .L2 sub r3, r3, r10 .L2: Can you try using -mno-strict-it on your examples and see whether that helps?