https://gcc.gnu.org/bugzilla/show_bug.cgi?id=62631
--- Comment #18 from amker at gcc dot gnu.org --- (In reply to Eric Botcazou from comment #16) > > The cost of expression "p + ((sizetype)(99 - i_6(D)) + 1) * 4" computed > > using normal +/-/* operators on sparc64 is 18, but the cost is 32 if it is > > computed as "p + ((sizetype)(99 - i_6(D)) + 1) << 2", which is returned by > > get_shiftadd_cost. > > How do you get the first number exactly? Note that the costs of shiftadd is In force_expr_to_var_cost, it calculates the cost in normal way before returning get_shiftadd_cost. > completely skewed (by a factor of 3) because expmed.c computes it as a > multadd instead of a shiftadd: > > Breakpoint 2, init_expmed_one_mode (all=0x7fffffffd540, mode=QImode, speed=1) > at /home/eric/svn/gcc/gcc/expmed.c:219 > 219 set_shiftadd_cost (speed, mode, m, set_src_cost > (all->shift_add, speed)); > (gdb) p debug_rtx(all->shift_add) > (plus:QI (mult:QI (reg:QI 109 [0]) > (const_int 2 [0x2])) > (reg:QI 109 [0])) > > but this should ensure that the costs are roughly the same for the > expressions. > > > From the assembly code, it seems the computation is expensive on sparc64, I > > may skip the test for these architectures if no other solutions. > > The hitch is that the code generated for 32-bit SPARC (where the test > passes) is the optimal one and is also valid for 64-bit SPARC. The assembly is as below on sparc64: f1: .register %g2, #scratch sllx %o1, 2, %g1 mov 99, %g2 add %o0, %g1, %o0 sub %g2, %o1, %o1 srl %o1, 0, %g1 add %g1, 1, %g1 sllx %g1, 2, %g1 add %o0, %g1, %g1 st %g0, [%o0] .LL5: add %o0, 4, %o0 cmp %o0, %g1 blu,a,pt %xcc, .LL5 st %g0, [%o0] jmp %o7+8 nop While more efficient on sparc32, as below: f1: sll %o1, 2, %g1 sub %g0, %o1, %o1 add %o0, %g1, %o0 sll %o1, 2, %o1 add %o1, 400, %g1 add %o0, %g1, %g1 st %g0, [%o0] .LL5: add %o0, 4, %o0 cmp %o0, %g1 blu,a .LL5 st %g0, [%o0] jmp %o7+8 nop The bloated pre-header happens on all 64 bits platforms. At least I can confrim that on aarch64 it is much worse than arm. The difference is, it is fixed in later compilation passes on aarch64 (I didn't investigate why or how). I think the cause is, on 64bits platforms, expression "p + ((sizetype)(99 - i_6(D)) + 1) * 4" != "p + ((sizetype)(100-i_6(D)) * 4" like on 32 platforms because sizetype has larger precision.