https://gcc.gnu.org/bugzilla/show_bug.cgi?id=17108
Segher Boessenkool <segher at gcc dot gnu.org> changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |segher at gcc dot gnu.org --- Comment #8 from Segher Boessenkool <segher at gcc dot gnu.org> --- We currently generate (for -O2 -m64, -O3 unrolls it completely, see comment 7) li 9,8 mtctr 9 .p2align 4,,15 .L2: stfs 1,0(3) addi 3,3,4 bdnz .L2 blr and for -m32 we get li 9,8 addi 3,3,-4 mtctr 9 .p2align 4,,15 .L2: stfsu 1,4(3) bdnz .L2 blr The difference is partly the selected -mcpu=, but that doesn't explain it completely. The gimple passes (probably ivopts) have decided to do a pre_inc here; all differences are at RTL level. Except for -mcpu=power9 they didn't. A case where it works as expected, -O2 -m32 -mcpu=power4, the auto_inc_dec pass does not help (this is caused by rtx_cost issues): starting bb 3 11: [r122:SI]=r127:SF 11: [r122:SI]=r127:SF found mem(11) *(r[122]+0) 10: r122:SI=r122:SI+0x4 10: r122:SI=r122:SI+0x4 found pre inc(10) r[122]+=4 11: [r122:SI]=r127:SF found mem(11) *(r[122]+0) trying SIMPLE_PRE_INC cost failure old=16 new=408 (I have a patch for that). but then combine comes along and does Trying 10 -> 11: 10: r122:SI=r122:SI+0x4 11: [r122:SI]=r127:SF Successfully matched this instruction: (parallel [ (set (mem:SF (plus:SI (reg:SI 122 [ ivtmp.10 ]) (const_int 4 [0x4])) [1 MEM[base: _17, offset: 0B]+0 S4 A32]) (reg/v:SF 127 [ d ])) (set (reg:SI 122 [ ivtmp.10 ]) (plus:SI (reg:SI 122 [ ivtmp.10 ]) (const_int 4 [0x4]))) ]) allowing combination of insns 10 and 11 original costs 4 + 4 = 8 replacement cost 4 -m64 however says Trying 10 -> 11: 10: r122:DI=r122:DI+0x4 11: [r122:DI]=r127:SF Failed to match this instruction: (parallel [ (set (mem:SF (plus:DI (reg:DI 122 [ ivtmp.11 ]) (const_int 4 [0x4])) [1 MEM[base: _17, offset: 0B]+0 S4 A32]) (reg/v:SF 127 [ d ])) (set (reg:DI 122 [ ivtmp.11 ]) (plus:DI (reg:DI 122 [ ivtmp.11 ]) (const_int 4 [0x4]))) ]) Oh dear, we do not have the float load/store-with-update instructions for -m64. On all modern 64-bit CPUs these are cracked, so they execute the same as the separate addi and store instructions, but it costs code space. And if we do not want them we should make them more expensive, not just pretend the insns do not exist :-)